Blockchain meets distributed database
Cover 2021-06-04 10:24:39

Until today, ten years after the birth of bitcoin , Our understanding of blockchain technology is moving towards “ This is a distributed database ” Direction of development . And so it is , Blockchain has only served cryptocurrency since its birth , Now, with smart contract 、 The development of consensus Technology , Blockchain is also gradually used to serve the general data management system . As the father of such systems , Distributed database has been developing for decades , As a rising star, blockchain technology inevitably needs to learn from the development experience of its predecessors ,“ I'll be you when I grow up ” No wonder . However , There are some subtle differences between blockchain and distributed database , How to distinguish these differences will help us understand these two technologies , It also helps us to grasp the development direction of new technology . This article focuses on the top conference in the field of database SIGMOD The latest article of this year [1] For the framework , Discuss the similarities and differences between blockchain and distributed database technologies , And the integration of the two in this decade .

Blockchain

Blockchain was originally only used to serve cryptocurrency , For example, bitcoin and other cryptocurrencies derived from it . From the perspective of data structure , Blockchains are linked by block hashes ( Except for the initial block , Each block holds the hash value of the parent block ) A string of blocks rising up , Each block holds a portion of the transaction . These blocks form a trading book , It records the whole transaction history since the first original transaction . From the perspective of Distributed Systems , Blockchain solved Byzantine in the open network ( There are malicious nodes ) Consensus issues . This part of the discussion can refer to my previous article : Blockchain and distributed systems ​

stay 2014 year , The emergence of Ethereum has brought smart contracts to blockchain . The emergence of smart contracts , The application of blockchain is not limited to cryptocurrency , You can also support Turing completely (Turing-complete) The application of Computing , This makes the blockchain gradually develop towards a general decentralized computing platform .

The development of blockchain technology also makes the emergence of another technology earlier than blockchain 10 Year study —— The Byzantine consensus has a second spring . Although both focus on consensus in the presence of malicious nodes , But the traditional way is to PBFT(Practical Byzantine fault-tolerance) This kind of research is limited to the closed network , That is, the identity of the node is known , Join and leave both need permission control mechanism . Later generations in order to integrate and distinguish the two technologies , Will be in bitcoin 、 The blockchain led by Ethereum is called permissionless blockchain( It is also known as the public chain ), And will be based on PBFT Or its derivative BFT The blockchain of the protocol is called permissioned blockchain( It's commonly known as alliance chain ). Alliance chain needs strong access control mechanism , Therefore, it is more suitable for enterprise application scenarios , For example, banks 、 Securities institutions, etc . From the figure below, we can see that the public chain 、 The security and performance trade-offs between alliance chain and traditional distributed database . Public chain has higher security ( There is no requirement for node identity ), But the performance is poor . Distributed databases are the other extreme , The security assumption is strong ( There is a trusted center ), So the consensus performance is high . The alliance chain is in between , Nodes can not trust each other , But the security assumption requires that the node's identity be known , And at least some nodes are normal , Compared with the traditional distributed database, the performance is slightly worse .

Distributed database

Database technology as a basic technology in the field of computer has gone through decades of spring and autumn . With the increasing demand for big data processing , The database is also developing towards distributed technology , The support of large-scale distributed database is indispensable for all kinds of Internet applications that we often come into contact with at present . In this trend , Traditional relational database can't meet the requirement of scalability , So new data models like NoSQL and NewSQL It's coming out .

As a more flexible data model ,NoSQL More likely to provide usability , Not consistency . use NoSQL There are many different consistency levels that you can choose from , Different levels lead to different performance of the system . Users can choose between performance and consistency according to the actual use scenarios .NoSQL This kind of design is more flexible , But it increases the complexity of the upper application , Therefore, a method between relational database and NoSQL Between the design ,NewSQL emerge as the times require .NewSQL It not only retains the data model of relational database, but also has a good understanding of ACID Semantic support , At the same time, it also maintains a certain degree of scalability .

Blockchain vs. Distributed database

There is no essential difference between blockchain and distributed database . Go further , I think blockchain just extends the application scenario of distributed database , And a series of existing technologies are used to solve the challenges brought by this new application scenario . This new application scenario is how nodes across multiple trust domains manage data effectively . in other words , In a traditional distributed database , All nodes are centralized by coordinator To manage , Nodes can trust each other without reservation , Distributed database only needs to consider the problem of node downtime . But when this coordinator defect , Or the participants are made up of different coordinator When managing , The problem that database should consider raises a difficulty , Because more than one node can go down , It's also possible to show all kinds of Byzantine behavior .

To meet the challenges of the new scene , Blockchain has to adopt some more conservative technologies ( Part of the performance is lost ), We can get a glimpse of it from the figure below . So let's start with Replication、Concurrency、Storage、Sharding From these four dimensions, blockchain is different from traditional distributed database in technology selection .

Replication

Copy the data (replication) It is the most direct and effective means to prevent the impact of node failure . However replication A serious problem is data consistency . A very classic way to solve the consistency problem is state replicator (state machine replication,SMR), That is, all nodes start in the same state , Maintain the same transaction log , So as long as each node executes each transaction in the same order , Then the state of each node should be the same .

Realization SMR Consensus algorithm is one of the key technologies in this field , Ensure the consistency and activity of data between nodes under certain network and fault tolerance assumptions . Different consensus algorithms are suitable for different networks and fault tolerance assumptions . In a traditional distributed database , Just tolerate node downtime , It is mainly used to Raft、Paxos And so on . However, in the blockchain , Due to the need to tolerate Byzantine behavior of nodes , So we have to adopt the more expensive PBFT、PoW And so on .

Except for consensus algorithms , Blockchain and distributed databases are still replication There are differences at the level of , As shown in the figure below . Distributed database (b) Because you can rely on a centralized coordinator, So before you do a copy, you can start with coordinator Divide the transaction into more fine-grained instructions and distribute them to different nodes for replication . The transaction itself does not need to be copied to all nodes , The node responsible for executing the instruction does not know the execution logic of the original transaction . However, blockchain (a) There is no trusted center , So we usually do replication at the transaction level , Then each node executes all the instructions contained in the transaction .

Concurrency

In order to improve the throughput of the system , Parallel processing of multiple transactions or instructions is one of the most important technologies in the field of database . Because different transactions may operate on the same data object , Therefore, how to ensure the correctness of execution while parallel processing is parallel control (concurrency control) It has always been a research hotspot in the field of database . The goal of parallel control is to make the execution of transaction achieve a certain degree “ Isolation, (isolation)”, In other words, when the transaction is executed in parallel, it doesn't seem to feel the existence of other transactions . Choosing between performance and correctness can divide the database into different isolation levels , From low to high Read uncommitted 、Read committed 、Repeatable read 、Serializable, The corresponding performance is also gradually declining . Product level databases generally provide multiple isolation levels .

In most existing blockchains , Transactions are still executed serially . Blockchain's support for parallelism is not good , One of the reasons is that in some existing blockchains , The execution layer is not the bottleneck yet . for example , In bitcoin , The execution time of a block is in milliseconds , Compared with 10 Minutes of block generation time , The executive part is almost negligible . besides , In some blockchains that support smart contracts , Transactions often share the state of the contract , In order to ensure the certainty of the transaction execution results (deterministic), Serial execution is often the simplest and safest way .

Storage

We all know that blockchain is a append-only My account book , It contains all the transaction history from Genesis block to the latest block , As a result, the storage capacity of many mainstream blockchains is often more than 100 GB. To support authenticity verification , Blockchain generally adopts similar Merkle Tree The data structure stores the transactions in the block . for example , Ethereum adopted Merkle Patricia Trie(MPT) Store the status of all accounts . However, in most databases , Unless there's something special provenance The needs of , Generally, users can only access the latest data . The historical data will be log It can be saved for a period of time for node failure recovery , But it's usually cleaned up regularly to save storage space . On the other hand , Because distributed databases care more about performance , Therefore, when building the index, special optimization will be carried out according to the nature of the hardware . for example , The data in the hard disk is usually B+ Tree data structure storage , In memory, it is more friendly to multi-core parallelism and caching FAST or PSL Isostructure .

Sharding

Fragmentation technology is a key technology to improve the scalability of distributed database . By decentralizing the data to different shard Handle , The system can achieve scale-out The effect of , in other words , With the increasing number of users and data , The overall throughput of the system also increases linearly , The fragmentation itself brings overhead Almost negligible .

However, it is not easy to introduce fragmentation into the blockchain , There are two main challenges : First of all , How to slice ? We all know that blockchains need to tolerate Byzantine errors , And it depends on one big premise , That is, a certain proportion of nodes in the network are honest . for example , stay PoW It's the one that demands total strength 50% It's honest , and PBFT It requires more than 2/3 The number of nodes is honest . When the blockchain network is partitioned, it is necessary to ensure that the security assumption of each partition is tenable , Once there is one shard The premise of safety is not established , Then the security of the whole system can not be guaranteed . However, when partitioning, nodes are usually randomly assigned to different nodes shard, This requires that the scale of summary points should be large enough , and shard You can't have too many , So that we can make sure that every shard There are enough nodes in the network to ensure that the security premise can be established .

The second challenge is how to ensure that shard Atomicity between , That is, a deal is either in all shard all commit, Or at all shard all abort. In a traditional distributed database , This atomicity is generally submitted in two phases (2 Phase Commit,2PC) Agreement to guarantee , It depends on a centralized coordinator To execute . However, in the blockchain , Because there is no centralized coordinator There is , We need to introduce some external BFT Agreement to co-ordinate cross-shard Transactions . Such as Ethereum 2.0 Medium Casper agreement [2].

Towards integration

With the gradual implementation of blockchain Technology , Both industry and academia are committed to improving the performance of blockchain , It is the simplest and safest way to learn from the mature technology of distributed database . for example ,BlockchainDB [3] and FalconDB [4] On the basis of the blockchain system, the database is introduced feature, So that the untrusted parties can participate in the maintenance of a verifiable database .

On the other hand , Some security features of blockchain are also favored by some database designers , So that some new databases that pursue more security also have the gene of blockchain . for example ,Blockchain Relational Database [5] Is in the PostgreSQL The new relational database is designed based on the decentralized and traceable features of blockchain .

reference

[1] Blockchains vs. Distributed Databases: Dichotomy and Fusion: Blockchains vs. Distributed Databases: Dichotomy and Fusion

[2] Casper: ethereum/casper

[3] BlockchainDB - A Shared Database on Blockchains: http://www.vldb.org/pvldb/vol12/p1597-el-hindi.pdf

[4] FalconDB: Blockchain-based Collaborative Database: http://www.cs.utah.edu/~lifeifei/papers/falcondb.pdf

[5] Blockchain Meets Database: Design and Implementation of a Blockchain Relational Database: http://www.vldb.org/pvldb/vol12/p1539-nathan.pdf

This article was first published in :https://zhuanlan.zhihu.com/p/372787705

Please bring the original link to reprint ,thank
Similar articles

2021-08-09

2021-08-09