在这里插入图片描述本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。

本作品 (李兆龙博文, 由李兆龙创作)，由李兆龙确认，转载请注明版权。

文章目录

引言
- Percolator
- TiKV中的应用
- Column Family
- 读放大
- Latches
- 缺陷
- 总结

引言

TiKV是Google Spanner的一个开源实现，其作为HTAP（Hybrid Transactional and Analytical Processing）数据库TiDB的行存储引擎，以支持对OLTP（On-Line Transaction Processing）类型负载的更多优化。这篇文章主要讨论TiKV中事务模型Percolator的理论，具体实现方式，以及我认为的缺陷所在。

Percolator

Percolator[1]是Google在构建索引时实现的一种分布式事务的解决方案，本质是在超大规模的数据集，且数据变化多为小范围增量更新的情况下满足业务对于一致性的要求。

Percolator was built specifically for incremental processing and is not intended to supplant existing solutions for most data processing tasks. Computations where the result can’t be broken down into small updates (sorting a file, for example) are better handled by MapReduce. Also, the computation should have strong consistency requirements; otherwise, Bigtable is sufficient. Finally, the computation should be very large in some dimension (total data size, CPU required for transformation, etc.); smaller computations not suited to MapReduce or Bigtable can be handled by traditional DBMSs.

并且在Percolator中引入了Primary的概念，其可以理解为在每一次的事务中选择一个key作为主键，把其当作此次事务的互斥点，这使得事务如果在提交时崩溃的话，这个事务就会错过提交点，而且锁信息也不会被处理。此时可以通过检查primary锁来区分这两种情况，检查primary锁如果其锁信息被修改为write信息，意味这primary已经被提交，此时认为事务已经被提交，这个落后的更新可以commit，否则的话其可以回滚。优点是什么呢，就是全局的事务管理器只需要有一个中心授时的功能就可以了。TiKV中的PD就有TSO的功能，为了保证集群拥有一个单调递增的时间戳。可以最大程度的保证不受单点问题的困扰，当然这个单点还负责调度，路由等功能。

在Percolator的原始论文中的展现形式是在BigTable中支持跨行事务，事务本身提供Repeatable Read级别的隔离性，也就是我们平时说的快照级别隔离[5]。其实就其本质而言，就是在TiKV中实现了一个跨Region的分布式事务，具体的实现形式就是MVCC+2PC。

对于Percolator的细节不必多多赘述，[1]中已经描述的非常清楚了，更多的可以在[2][3][4]中学习。

TiKV中的应用

下图是TiKV的基本架构图：
在这里插入图片描述
因为TiKV的分布式事务应该是允许跨region的，所以2PC中的每一次prewrite可能是分布于不同的机器上的，所以事务部分的代码需要client和server之间配合才可以完成，客户端这部分需要做的事情其实可以大概分为以下几个：

向PD请求事务开始和commit的时间戳
通过事务本身大小计算ttl，并判断其他限制（事务大小等）
把事务通过机器的不同分成不同的batch，并行发送
错误处理（比较麻烦）

server端需要做的就是提供几种不同的原语，供客户端使用，以支持分布式事务，比如：

KvGet：获取某个值最新的数据
KvPrewrite：预写操作
KvCommit：提交操作
KvScan：和Get差不多，不过这里取一个区间，需要跳过多个版本的数据
KvCheckTxnStatus：检查对应key的锁信息（ttl），决定应该rollback还是commit，通常在prewrite失败的时候调用，用以判断下一步是回滚还是继续提交
KvBatchRollback：回滚
KvResolveLock：用于检测当前primarykey的值决定commit还是rollback（提交可能失败，判断这个事务的互斥点）

Column Family

TiKV使用RocksDB作为底层的存储引擎，RocksDB有一项特性叫做Column Family，也就是代码中经常可以见到的CF，其可以看作DB内的namespace，不同的CF拥有不同的LSM-tree，但是它们却共享同一个WAL，这也意味这跨CF原子写入的能力，我们可以简单的把不同的CF当作不同的数据库，但是我们可以跨CF支持原子写入。

维基百科中对其优势的解释如下：

Accessing the data in a distributed data store would be expensive (time-consuming), if it would be saved in form of a table. It would also be inefficient to read all column families that would make up a row in a relational table and put it together to form a row, as the data for it is distributed on a large number of nodes. Therefore, the user accesses only the related information required.
As an example, a relational table could consist of the columns UID, first name, surname, birthdate, gender, etc. In a distributed data store, the same table would be implemented by creating columns families for “UID, first name, surname”, “birthdate, gender”, etc. If one needs only the males that were born between 1950 and 1960, for a query in the relational database, all the table has to be read. In a distributed data store, it suffices to access only the second standard column family, as the rest of information is irrelevant.

如果按照上面的流程学习过Percolator的话，我们就会知道Percolator与事务相关在表中加入了两个column，一个是write，一个是lock，前者记录commit相关的数据，后者记录锁相关的信息。
在这里插入图片描述
那么正常我们就可以把TiKV划分为3个CF，default代表数据域：

const (
    CfDefault string = "default"
    CfWrite   string = "write"
    CfLock    string = "lock"
)

观察TiKV的架构图，我们发现Transaction是建立在MVCC之上的，意味着这里我们要进行大量的检查lock和write的操作，如果把这些信息都放在default和正常的数据组成一行，这里的读放大情况就会比较严重。这也是引入Column Family的原因。

读放大

当然引入column family一定程度上减小了读放大的情况，但是考虑一种情况，即短时间内某个key变更频繁，暂且不考虑事务冲突的概率，这里多版本的key数据就会使得每次读出现读放大情况，这个问题可以通过数据拆分的方式解决，核心的思想也比较简单，本质上就是对某个key的多版本数据建立索引，将一个key变成多个key：

原始数据：

Meta0 (v127 - v0) next: 0

第一次分裂：

Meta0 (v128 - v96) next:1
Meta1 (v95 - v0) next:0

第二次分裂：

Meta0 (v224 - v192) next:2
Meta1 (v95 - v0) next: 0
Meta2 (v191 - v96) next:1

这样读放大的问题就解决了，当然还有一个方法，就是在数据域中我们不把多版本的数据写入一个key，而是对上面提到的三个column采用如下的编码方式：

CF_DEFAULT: (key, start_ts) -> value
CF_LOCK: key -> lock_info
CF_WRITE: (key, commit_ts) -> write_info

这样每次读取就不会出现上面所谓的key多版本数据的读放大情况了，既然采用的(key, start_ts) -> value的格式编码default域的数据，也就意味这我们在使用迭代器进行访问的时候key相关版本的数据是排列在一起的，这里我们显然希望更新的数据排列在前面，这里有一种编码格式称为Memcomparable format[8]。

文档中读其描述如下：

It is important to be able to compare keys without de-serializing and doing type-specific comparisons. It is possible to represent MySQL data in a memcomparable way. There is a number of subtle details to get things right, including: multi-column keys, case-insensitive collations, even more complex collations, multi-column rows with variable length columns, etc.

最大的挑战就是如何在value不确定类型的情况下不反序列化的进行比较。细节可参考[8]

Latches

Latches被使用于事务中出现条件竞争时使用，事务的处理我们需要提供一个Start_ts来标记是哪一个事务，倘若Commit和Rollback在执行时采用同一时间戳，它们可能会在中间获取lock信息和write信息时获取同一份信息，然后都执行写入，此时哪一个成功就是未知的来，所以我们引入Latches解决本地多client的数据竞争问题。

func (l *Latches) WaitForLatches(keysToLatch [][]byte) {
    for {
        wg := l.AcquireLatches(keysToLatch)
        if wg == nil {
            return
        }
        wg.Wait()
    }
}

func (l *Latches) AcquireLatches(keysToLatch [][]byte) *sync.WaitGroup {
    l.latchGuard.Lock()
    defer l.latchGuard.Unlock()

    // Check none of the keys we want to write are locked.
    for _, key := range keysToLatch {
        if latchWg, ok := l.latchMap[string(key)]; ok {
            // 操作的key哪些已经被加锁了就返回对方的Wg等待
            return latchWg
        }
    }

    // All Latches are available, lock them all with a new wait group.
    wg := new(sync.WaitGroup)
    wg.Add(1)
    for _, key := range keysToLatch {
        l.latchMap[string(key)] = wg
    }

    return nil
}

func (l *Latches) ReleaseLatches(keysToUnlatch [][]byte) {
    l.latchGuard.Lock()
    defer l.latchGuard.Unlock()

    first := true
    for _, key := range keysToUnlatch {
        if first {
            wg := l.latchMap[string(key)]
            wg.Done()
            first = false
        }
        delete(l.latchMap, string(key))
    }
}

这样可以在本地操作出现冲突时等待，以避免脏数据。

缺陷

我们前面提到Tikv-client会在发送请求是统计事务的大小，超过限制后拒绝执行，本质还是为了降低冲突的概率，我们可以在prewrite中看到事务出现任意一个key冲突时都会报错以回滚事务，所以原始论文中提到Percolator适合于大规模的小计算任务，但是TiKV中设计的事务是否满足这种负载暂且不确定，大量冲突的事务会使得性能的急剧下降，其毕竟作为TiDB的行存储引擎，性能与实际负载也是有很大关系的。

[3]中提到TiKV 在存储节点本地添加了一个简单的 Scheduler 层，这部分可以参考[9]，在 2PC 读写遇到锁的时候并不是粗暴的直接回滚返回，而是尝试在本地排队等一下，如果超时或者其他异常，再返回客户端重试，减小了网络的开销。

其次我们要确定一点，就是Percolator的性能问题，论文中也有如下描述：

One of the inefficiencies of Percolator relative to a MapReduce-based system is the number of RPCs sent per work-unit. While MapReduce does a single large read to GFS and obtains all of the data for 10s or 100s of web pages, Percolator performs around 50 individual Bigtable operations to process a single document.
One source of additional RPCs occurs during commit. When writing a lock, we must do a read-modify-write operation requiring two Bigtable RPCs: one to read for conflicting locks or writes and another to write the new lock.

MapReduce通常只对GFS执行一个大型的read操作以获取所有需要的数据，而Percolator处理一个文档就需要执行大约50个单独的Bigtable操作。导致RPC太多的其中一个因素发生在commit期间。当写入一个锁时就需要两个Bigtable的RPC：一个为查询冲突锁或写记录，另一个来写入新锁。

总结

基本上事务的过程可以用下图概述：
在这里插入图片描述
显然可以看出是一个典型的乐观事务模型，写入被缓存在Tikv-client中，直到提交时才去判断与其他的事务是否出现冲突，冲突的判断其实就是使用start_ts时间戳确定一个事务，再去与lock和write中的数据做比较，以此判断冲突。

参考：

原文链接: https://www.cnblogs.com/lizhaolong/p/16437169.html

欢迎关注

微信关注下方公众号，第一时间获取干货硬货；公众号内回复【pdf】免费获取数百本计算机经典书籍;

也有高质量的技术群，里面有嵌入式、搜广推等BAT大佬

原创文章受到原创版权保护。转载请注明出处：https://www.ccppcoding.com/archives/396287

非原创文章文中已经注明原地址，如有侵权，联系删除

关注公众号【高性能架构探索】，第一时间获取最新文章

转载文章受原作者版权保护。转载请注明原作者出处！

TiKV事务模型探究：Percolator

文章目录

引言

Percolator

TiKV中的应用

Column Family

读放大

Latches

缺陷

总结

相关推荐