FoundationDB（ abbreviation FDB）, yes Apple The company's open source one that supports distributed transactions Key-Value Storage , It can be thought of as similar PingCAP Open source TiKV. It recently published a paper FoundationDB: A Distributed Unbundled Transactional KeyValue Store, The internal implementation principle is introduced . This paper is a summary of this paper and its Official documents  Learning notes of .
The overall architecture
As shown in the figure above ,FDB It's a good structure , Big modules can be divided into three parts ：
Client interface Client
Control plane Control Plane
Data plane Data Plane
This paper focuses on Control Plane and Data Plane.
Control Plane Manage the metadata of the cluster , Use Active Disk Paxos To ensure high availability .Control Plane It's divided into the following parts ：
How many? coordinator Processes make up a paxos group, One of them leader, be called cluster controller.Cluster controller Responsible for fault detection , The role of managing processes , Assemble 、 Deliver all kinds of information about the whole cluster . meanwhile ,cluster controller It's the entry point to the whole cluster .Client Through a preservation of coordinator Of IP:Port To access the cluster , And from cluster controller Get the latest proxy list .
DataDistributor Responsible for monitoring StorageServer The fault situation and the balanced scheduling of data .
Ratekeeper Overload protection is implemented by controlling the allocation speed of monotonically increasing timestamps .
Data Plane In general, it can be divided into three parts ：
Transaction System Responsible for implementation serializable snapshot isolation Level distributed transactions .
Log System Responsible for the replication of logs , Ensure high availability of the system .
Storage System Save actual data , Or state machines , From Log System Asynchronously pull logs for apply. At present, the stand-alone storage engine uses a modified SQLite.
Transaction System More complicated , In general, it can be layered into three modules ：
Proxy As a transaction system, facing client The proxy interface of , All transaction requests need to go through proxy obtain read version, obtain key ranges Corresponding storage server Information about , Commit transaction .
Sequencer Responsible for allocating incremental read version and commit version.
Resolver be responsible for SSI Level transaction conflict detection .
Log replication is necessary for every distributed database to achieve high availability .
FDB There are two types of data that need to be replicated ：Control Plane Metadata and Data Plane User data for .
Control Plane The replication of metadata logs using Active Disk Paxos（Paxos A variation of ）. be based on Paxos It is a common industry practice to copy logs to achieve high availability of the system , It also does not rely on external arbitration to achieve high availability 、 The standard method of strong consistency .
Data Plane Synchronous replication is adopted —— Every time you copy it to f + 1 Nodes , Only f +1 Only when all nodes are successful can success be returned . It just takes f + 1 One copy is tolerable f Copies are missing , And use Paxos/Raft If you copy it , need 2 * f + 1 Copies .
look , Synchronous log replication costs more than Paxos/Raft In a much lower way , But here Log System Main selection and fault recovery 、 Fail over should be heavily dependent cluster controller Arbitration . And any copy failure will lead to write failure , If the Paxos/Raft The way , Only leader Failure causes write failure . in addition , there membership change How it's done ？ This estimate needs to see the code . There is no hole in it ？ It is estimated that it will be clear after deep use .
Data Plane Distributed transactions are typical of OLTP Scene design —— Horizontal expansion 、 Read more and write less 、 Small business （5 second ）.FDB adopt OCC +MVCC To achieve SSI Transaction isolation level for .
The basic process of a transaction is as follows ：
Client -> proxy -> sequencer obtain read version.
Client with read version -> storage according to read version Read data snapshot .
Write requests are cached locally before they are submitted .
Transaction submission ：
Client Send read set and write set to proxy.
Proxy from sequencer obtain commit version.
Proxy Send read and write sets to resolver Do transaction conflict detection .
If the transaction conflicts , be abort fall . otherwise proxy Send write set to log server persist , Complete transaction commit .
Transaction execution flow , Here are a few key issues .
How to decide read version？
How to detect transaction conflicts , To satisfy SSI？
How to recover from failure ？
How to decide read version？
according to SSI Business requirements , adopt read version, Must be able to read all commit version Less than or equal to read version The business of . For transactions that have already been committed , It's not a big problem . The main problem is concurrent transactions .
for instance ：
Business A And transaction B It's a concurrent transaction ,A obtain read vesion, Ready to read data ;B obtain commit version, Ready to commit transaction .
If read version < commit version, Business A In any case, the transaction cannot be read B The data of , So no problem .
If read version >= commit version, Business A Need to be able to read transactions B The data of , But at this point the business B Maybe not yet …
How to solve this problem ？ The more traditional approach is ,commit Lock the data first , Express pending The business of . When reading , You need to check if the data is locked , If the data is locked , Indicates that there is a transaction being written , Wait for the transaction to execute 、 Or push things forward commit/abort、 perhaps abort Current affairs .
FDB The approach is ： From all proxy Collect the latest and biggest commit version As read version. And because the commit version The allocation of is globally monotonically increasing , So you can guarantee , Concurrent transaction commit version Certainly more than read version.
When a client requests a read version from a proxy, the proxy asks all other proxies for their last commit versions, and checks a set of transaction logs satisfying replication policy are live. Then the proxy returns the maximum commit version as the read version to the client.
Although proxy collect commit version It's easy to implement batch Optimize to improve performance , But every time you need to visit all proxy, I doubt its scalability and usability .
How to detect transaction conflicts , To satisfy SSI？
FDB The transaction conflict detection algorithm of is very simple , The core point is Check for read-write conflicts （Read-Write Conflict）.
As for why just checking for read-write conflicts is enough SSI, This paper does not explain . I read a paper before A Critique of Snapshot Isolation, This paper systematically introduces Snapshot Isolation and Serializable Snapshot Isolation. This paper introduces the implementation of SSI Transaction conflict detection algorithm and FDB Is essentially the same , If you are interested, please read .
FDB The transaction implementation of is actually the same as Omid It's like , If you are interested, please refer to related papers ：
Omid: Lock-free Transactional Support forDistributed Data Stores
Omid, Reloaded: Scalable and Highly-AvailableTransaction Processing
A Critique of Snapshot Isolation Actually, too. Omid A related paper of .
How to recover from failure ？
FDB I think it's strange for me to design the fault recovery system , I don't quite understand .FDB The core process of a transaction system is ：Sequencer、Proxy、Resolver、LogServer, among Sequencer It's a transaction system “ master control ”：
If Sequencer Hang up , There will be a new Seqencer Being pulled up again .
If other processes fail ,Sequencer Will commit suicide , Then there will be a new Seqencer Being pulled up again .
new Sequencer from Coordinators Get information about the old transaction system , Including all the Proxy、Resolver、LogServer, Prevent them from processing new transactions . then , Reorganize a new transaction system （Proxy、Resolver、LogServer）.
When the transaction system recovers , The main thing is to make sure LogServer The latest... That has been submitted log The location of , It's called Recovery Version, Before this position （ Including this location ） Your log has been submitted , The log after this location can be directly ignored .
FDB This way of synchronous replication , Recovery is very simple ： Collect at least m-k+1 individual LogServer Maximum persistence of LSN, Then take the smallest one .
m yes LogServer Total of
k Is the number of copies of log synchronous replication .
FDB The implementation of is a little more complicated than the above , But the principle is similar .
FDB There's a lot of emphasis on design layering 、 Module decoupling 、 The idea of divide and rule . Like the above Control Plane and Data Plane, The design is divided into multiple functional modules .
FDB Only available get()、set()、getRange()、clear()、commit() These simple interfaces .FDB I think I have realized the bottom layer of a database （"lower half"）—— A distributed system supporting transactions KV. Any other data model , Like relational models 、 Document model , All of them can be realized through a layer of stateless services .
FoundationDB's focus on the "lower half" of a database, leaving the rest to its "layers"—stateless applications developed on top to provide various data models and other capabilities.
Here you can refer to FDB A previous paper ：FoundationDB Record Layer: A Multi-Tenant Structured Datastore.FDB Record Layer Is in FDB A model similar to relational database is designed on , It's also open source , But it doesn't seem to support SQL.
Based on a pure distributed KV Implement all data models , It's an ideal architecture , But the reality may not be as good as you think （ It's mainly about performance ）. Interested in this topic , Take a look at this article ：FOUNDATIONDB'S LESSON: A FAST KEY-VALUE STORE IS NOT ENOUGH.
The simulation test
Distributed system testing is a very troublesome but very important thing .FDB From the beginning of the design, it has been emphasized that “ test ”. Perfect test , It's the foundation of rapid software iteration .
FDB to C++ Expanded , Added some keywords for “ Native ” Support asynchronous programming , It's called Flow.
In addition to more convenient support for asynchronous programming ,Flow And to all external dependencies , For example, file system 、 Network simulation , Convenient for all kinds of error injection and scene simulation .
After reading this paper , I find , I don't know anything . I really want to know the details , Let's take a look at the code …
FoundationDB: A Distributed Unbundled Transactional KeyValue Store: https://www.foundationdb.org/files/fdb-paper.pdf
Official documents : https://apple.github.io/foundationdb/contents.html
FoundationDB Record Layer: A Multi-Tenant Structured Datastore: https://www.foundationdb.org/files/record-layer-paper.pdf
FDB Record Layer: https://github.com/FoundationDB/fdb-record-layer
FOUNDATIONDB'S LESSON: A FAST KEY-VALUE STORE IS NOT ENOUGH: https://www.voltdb.com/blog/2015/04/foundationdbs-lesson-fast-key-value-store-not-enough/