With the continuous development of information technology , At the same time, many new business scenarios are created , Database is no exception . In the current big data 、 Cloud computing and other information technology , There are many types of databases .
About the classification of databases , The first classification is , It can be divided according to the business scenario of the database . Usually when we talk about databases , First of all, the database is OLAP still OLTP？
OLAP, Online analytical processing .OLAP The first characteristic is that the amount of data is relatively large , It is generally required that PB Or more data , When the amount of data is large , Sensitive to the cost of storage , There will also be certain requirements for data compression .OLAP The concurrency of the business system will not be particularly high , but OLAP In the scenario, the query is usually more complex , Each query consumes a lot of resources , It will require multiple users to reduce the impact of each other , Isolate resources . There are many similar products , such as :TeraData、SybaseIQ、GreenPlum、HP Vetica、AWS Redshift, And now more popular ClickHouse etc. .
OLTP, Online transaction processing . The amount of online transaction processing data is relatively small , Universal delay requires high concurrency and low delay .OLTP Business systems are our core business systems , Including bank 、 insurance 、 Real time online services like telecommunications , Business characteristics determine that there are some outstanding requirements for disaster recovery capability , Generally speaking, it requires 99.999% Availability of the above . Traditionally speaking , Generally speaking, database means ：Oracle、IBM DB2、Informix、MySQL, as well as PostgreSQL Some of these databases .
In the past two years, a database concept called HTAP, Hybrid transaction processing and online analytical database . The basic idea is to be able to process simultaneously within a single cluster OLAP and OLTP Two types of business .
The database architecture has evolved over the years , There are probably three architectures . The first is a single database , The so-called monomer database is just like what we often mentioned before Oracle、PostgreSQL、MySQL This stand-alone database , A single instance can provide independent services , The primary and secondary machines do it through stream replication HA, This is the traditional architecture .
The second is shared storage architecture , Multiple database instances access one storage at the same time , Data is stored in a dedicated storage device , Here the storage device generally refers to the disk array or similar special storage device ,Oracle RAC It's a typical architecture like this .
The third is no sharing , That's what we often say MPP. Every DN The node stores a piece of data , stay DN There will be another layer of nodes above the nodes , This layer of nodes have different names in different databases , But it works the same way , It's all about receiving business requests , Then distribute it , At the same time, the business request is returned .TeraData、GreenPlum、TDSQL All belong to this kind of Architecture .
MPP The implementation of the architecture can be divided into sharing master And no sharing master 2 class ： Shared master The architecture of , Application connection master Visit , Typical representative products are Greenplum、Netezza. No sharing master There will be multiple peer-to-peer access entry nodes for front-end application connection access , Typical representative products are teradata,hp vertica, TDSQL It also belongs to none master framework .
Compared to sharing Master In terms of Architecture , No sharing Master The structure has several characteristics ：
All nodes are equal You can query or load data through any node There is no performance bottleneck or single point of risk
MPP The architecture of database is related to hadoop/SPARK The difference between big data systems represented by ： Big data technology mainly deals with big data capacity , Generally, you can get to PB And even EB The amount of data at level , Most of them are unstructured / Semi structured data , The scale of the machine can reach tens of thousands , In batch processing 、 There are advantages in stream processing . But there are still deficiencies in the field of correlation analysis , There are few integrable analysis software on the market .
and MPP The main solution is structured data , The stability of 、 Business ACID Strictness 、 Complex data processing 、 Correlation analysis 、 It has traditional advantages in response speed and so on , Its flexible data analysis capabilities , Easy to integrate with various analysis software .MPP Architecture database , The main use is SQL, So it's easy to train users , With a mature talent market . Relatively speaking, big data technology requires high skills of users , The cost of learning will also be higher . Enterprise users can choose their own technology according to their own business characteristics .
One 、 Distributed database TDSQL-PG brief introduction
TDSQL-PG Tencent is based on open source PostgreSQL Independent research and development of a new generation of distributed enterprise class HTAP Database engine , Full compatibility PostgreSQL, Highly compatible Oracle grammar . The product adopts no sharing architecture , Support row column hybrid storage , While providing the processing capacity of large-scale data warehouse, it can also fully support distributed transactions ACID Ability .
TDSQL-PG experience 10 It's hard to polish , It can be divided into the following stages ： The era of single machine , As Tencent big data platform TDW A supplement to , Make up for the lack of small data analysis ability ; As the business grows , stand-alone PostgreSQL The bottleneck of the market is gradually highlighted , Promote the team to launch with good scalability and SQL Compatibility ability of V1 edition , And in 2015 Wechat payment merchant system was launched in Spring Festival . later ,TDSQL-PG Open to the enterprise market , stay V2 Version kernel supports separation of powers , Encryption desensitization and other security features , stay 2018 In 2005, we have achieved many benchmark customer applications such as digital Guangdong ;TDSQL-PG V3 Version positioning HTAP, And in 2019 Annual online PICC The core business of the group . Last year's V5 Version kernel has Oracle Compatibility and read write separation function , And put into production online operator system .
in the light of TDSQL-PG The applicable scenarios of , In two ways ：
** The first is business characteristics ,** If the business meets these characteristics , that TDSQL-PG It's very suitable —— In terms of the amount of data ,OLTP exceed 1T,OLAP The amount of scene data exceeds 5T; The number of concurrent connections exceeds 2000, Peak business 100w/s; Need to expand capabilities online and horizontally ; In addition, it needs to be considered by colleagues OLTP as well as OLAP Of HTAP scene ; And it needs strict distributed transaction guarantee ;
** The second is the business scenario ,**TDSQL-PG stay HTAP scene 、 Geographic Information System , And real-time high concurrency 、 Database localization and other scenarios are also good choices . TDSQL-PG use share nothing Of MPP framework , It has good scalability and performance . Its overall physical architecture is divided into three parts ：
Transaction manager , It mainly provides information about global transactions , Also manage global objects . In addition, the upper layer on the right Coordinator（ Coordinate nodes CN）, It mainly provides business access . Each node in the coordination node is equivalent , In other words, the business accesses any of the three nodes , It's going to get the same results , And access this node , The consistency of transactions can be guaranteed .
The data interaction bus in the middle of the figure organically combines the data of each node of the whole distributed cluster , Responsible for the data interaction of all nodes in the whole cluster . The lower and middle layers are data nodes , Data nodes are places where we actually store data ; Each data node stores a local copy of Local Metadata , At the same time, there are local data fragmentation , All the data pieces are combined to form a complete user data set .
The functions of the management and control system are mainly on the left and the bottom , Responsible for resource allocation of cluster nodes , Monitoring alarm 、 Operation and maintenance management, etc .
Two 、TDSQL-PG HTAP Ability introduction
1. Distributed join Way of execution
A very important problem of distributed database is the query problem , stay MPP Under the framework of each DN The data are incomplete . In order to complete a complete distributed query , The strategy needs to be chosen .
There are two main strategies here ： One is PUSH QUERY, By pushing its query down to DN Node up , SQL stay DN On the implementation , Return the data to CN. Another way is to PULL DATA, That is, by putting DN The data on the node is pulled to CN through CN To do all the calculations . When there's a lot of data , PUSH QUERY It's a lot more efficient ,PULL DATA When the amount of data is small , Efficiency will be better .TDSQL-PG There will be... In both ways , The actual execution will be selected according to the optimizer PUSH QUERY perhaps PULL DATA.
In the business analysis scenario , Usually there will be 2 One or more tables are associated with (join) The logic of , This is in a stand-alone instance in , It's a simple operation , But in cluster mode , Because the data is distributed in 1 In one or more physical nodes , It's also relatively complicated to deal with . In many distributed solutions ,join Will pull the data to a node , Do correlation calculation , This not only consumes a lot of network resources , And it's time consuming .
TDSQL-PG In many ways Distributed join Do efficient calculations . Let's briefly introduce TDSQL-PG Distributed join Implementation principle of .
To introduce the principle , Let's take an example from two tables . among TBL_A There are two columns f1,f2, among f1 For distribution key, TBL_B There are also two columns f1,f2, among f1 Also for its distribution key.
join It can be pushed down to the data node , It can be divided into the following two categories ：
Participate in join Two tables of join key And the distribution of tables key Phase at the same time ,join It can be pushed down to the data node for , For example SQL Participate in join Of key by TBL_A.f1 as well as TBL_B.f1, these two items. key They are also TBL_A、TBL_B The distribution of key, Because the data distribution algorithms of different tables are consistent , So the same id The data of is in the same node , So each node does it separately join The process , Summarize the results .
The association between distribution tables and replication tables , We assume that TBL_A It's big , however TBL_B The table is a small one , Then you can also use TDSQL-PG The replication table feature of , take TBL_B Defined as replication table , that TBL_B There is a complete copy at each data node ,join It can also be pushed down to each data node join Calculation , TDSQL-PG The coordination node completes the summary of the results .
And then we're going to talk about the most complex , That is to say join Scenes that can't be pushed down ： Two tables TBL_A,TBL_B Count According to a lot of data , The statement executed is :select * from TBL_A join TBL_B on TBL_A.f1 = TBL_B.f2; This statement does not satisfy the two situations mentioned above , That is to say B surface join key And Distribution key atypism , And there's no little table .
As mentioned earlier, many distributed solutions pull data to a node , Do correlation calculation ,TDSQL-PG We didn't use this resource consuming and low performance method . that TDSQL-PG How to deal with this situation ?
Before introducing the treatment , Let me introduce you TDSQL-PG Distributed execution principle of . First of all, in the way of execution , The coordination node receives the user's SQL request , According to the collected cluster statistics , Generate the optimal cluster level distributed query plan , And it is distributed to the data nodes participating in the calculation for execution , In other words, the coordination node issues the execution plan , The data node is responsible for executing the plan . In data interaction , Efficient data exchange channels are established between data nodes , Data nodes can exchange data efficiently , The process of data exchange is TDSQL-PG in Data redistribution (data redistribution). With efficient global query plan and data redistribution technology support ,TDSQL-PG It's easy to take advantage of parallel computing , Efficient completion join The process of . And in this case, according to B Tabular f2 Field to redistribute , To complete the entire query .
2. OLTP And OLAP The integration solution of
In order to implement in a cluster at the same time OLTP as well as OLAP Fusion , So we need to do the following ：
OLAP as well as OLTP The business is accessing the same data ; Resource isolation , Guarantee OLAP Business does not affect OLTP Business performance ; Can target OLTP as well as OLAP Different business optimizations . TDSQL-PG In order to achieve OLAP And OLTP The scheme as shown in the figure below is adopted for the integration of , The core ideas are as follows :
In the cluster coordinator Nodes provide OLTP as well as OLAP Two plane views . OLTP The business is running in datanode On the primary node ,OLAP The business is running in datanode The standby node of the node light , The data synchronization between them is carried out by means of stream replication . The kernel optimizer selects the corresponding optimizer according to the plane of the query . OLTP as well as OLAP The storage format is suitable for different loads .
3. Row column hybrid storage and compression
As HTAP System ,TDSQL-PG It supports row column hybrid storage ：
By line storage format , Data is stored in the same way in logical order , All column data in a row is stored on the physical disk in order , The benefits of this format are obvious —— If you access multiple columns of data in a row at the same time , One You only need one disk IO, More suitable OLTP Type of load . Storage format by column , Each column of data in the table is stored as a separate disk file , For example ,“ full name ”, “ department ”,“ Age ”…… The data in each column is a separate data file , This format can save a lot of disk compared with row storage when a few columns in the table need to be accessed at one time IO, It's especially efficient in aggregation scenarios , So it's more used in OLAP Class system .
Row storage is TDSQL-PG The basic storage format of , In order to support efficient OLAP ,TDSQL-PG It also provides a complete mixed storage capacity of column storage , Business can choose the required storage format for the data written into the database according to their own needs .
Column storage module , Let's talk about column storage compression capabilities .
TDSQL-PG There are two types of compression for column storage ：
** The first is lightweight compression .** Lightweight compression first perceives the specific content of the data , So according to the characteristics of the data to choose different compression methods to improve the compression ratio , Lower the cost of the business , At present, we support RLE Compression mode of .
** The second is transparent compression .** This compression method is used directly, including zstd and gzip Direct compression , This kind of compression has no explicit requirement for the storage content of data , Any information can be compressed . Through data compression , The volume of data can be greatly reduced , On the one hand, reduce the cost of users , On the other hand, it can be reduced when a large number of queries are analyzed IO Traffic volume , Improve our query efficiency .
4. Efficient actuators ： Distributed delay materialization
At the execution engine level ,TDSQL-PG After investigating the technology trend and development direction of the industry , Delay materialization is introduced into the engine . As opposed to delayed materialization , It's common to materialize in advance . Materialization in advance is shown in the figure on the left , The projection column and join All the columns are scanned and passed on , This will cause memory loss, especially when the selection rate is very low cache And a lot of invalid data in the network , Increasing the cache miss And reduce the efficiency of network data exchange .
Delayed materialization will take place at the lower level Scan When , Will need join The location information of the column and physical tuple of is transmitted to the upper node . Only when the upper nodes are finished Join After correlation , In the projection stage, the required data information is pulled from the lower layer according to the position information of the records meeting the conditions , And then to the outside , So as to build tuple attributes at least 、 Generate the least number of tuples , In this way, we can greatly reduce cache miss, save CPU Computing overhead and network IO The cost of .
5. Full parallel computing power
The key point for database is undoubtedly high performance computing . The following is an introduction TDSQL PG Work on high speed parallel computing .
TDSQL-PG Full parallel is divided into three levels ：
The first layer is node level parallelism ： The so-called node level parallelism is , After the system gets a query , The query will be distributed to different DN, adopt DN The node level parallelism can be achieved by the query in different areas ;
The second is process level parallelism ： The actuator gets the assignment and parallelizes the operator , That is to use as much as possible and allow more CPU Resources to complete the query work , Pass many CPU Methods to improve the efficiency of the query ;
The third layer is instruction level parallelism ： Including for CPU Special instructions for 、SMD Instruction, etc , By simple arithmetic or evaluation , And improve the query efficiency by optimizing and parallelizing the specified values .
TDSQL-PG Through these three parallel layers to satisfy complex query 、 High performance requirements of real-time computing .
In a distributed scenario , How my data is distributed in the cluster , How to store the data in each node of our distributed cluster ,TDSQL-PG It's called the distribution type of the selection table , The current support 2 Kind of ：
The first is to replicate tables , On the specified node in the cluster , Each node will have a full copy of the data , Such a table is especially suitable for small tables with small changes , It's useful to speed up some related queries .
The second is HASH Distribution table , Is to write data to the storage node hash Break up to different nodes .
however TDSQL-PG Of hash The data distribution of table is different from other databases ： Other database systems generally follow the value of the distribution key hash After the DN node hash The remainder is deposited in the corresponding DN node , however TDSQL-PG Introduced... In data routing shard map, It's not true to take the remainder DN The number of nodes is shard Count （ It's usually 2048 perhaps 4096, Far more than the number of nodes ）, The value calculated in this way is called shardid, Every DN What are the corresponding nodes shardid, We go through shardmap For storage . For example, in this example shard The number is 2048, Yes 2 individual DN node ,DN1 Store odd numbers on sharedid data ,DN2 Even numbers are stored on shardid data . Why do you do this , Because considering a subsequent expansion problem .
6. TDSQL-PG Horizontal expansion application
Data distribution of other database systems , During the expansion, because of the new DN3 node , Systematic node The number has changed , So the way of data routing will also change , All the existing data in the system should be rerouted according to the new routing method hash reblance, When the amount of data is large , This process will take a long time , And this process needs to block the front-end business writing .
TDSQL-PG Because the introduction of shard map, The expansion has added DN3 after , Just put the original DN1 and DN2 Some of the shardid（ In this case, yes s03 s04） Just move the data , Other irrelevant shardid No change , In this process, the front-end business can write normally ,shard Move after copying s03 s04 It can automatically track down the newly written incremental data after the inventory , Refresh when you catch up shard map Routing information for , take s03 s04 Map of to new add DN3 node . Follow up visit to this shard When it comes to information DN3 obtain . This process is completely transparent to the business , It does not affect the normal writing of business . In this way ,TDSQL-PG Flexible expansion can be easily achieved 、 Shrinkage capacity 、 Load balancing , And this process can be completely business insensitive .
In data governance , because TDSQL PG Support range、list 、hash 、 High performance equispaced partition .TDSQL-PG By partitioning tables , Store different sub partitions to different node groups , Different node groups are associated with different machines . In this way, the thermal data can be stored on the configured machine , Cold data is stored on a slightly inferior machine , In order to realize the hierarchical storage of hot and cold data , Reduce the storage cost of user data .
3、 ... and 、TDSQL-PG Best practices
1. Wechat payment TDSQL-PG Is in 2015 At the beginning of the year, wechat payment replaced the original sub database and sub table cluster and went online , Support wechat payment from everyday 500 Ten thousand to more than a day 10 Billion pen , Ensure business stability and continuity , Here we use the data governance function .
At the top of the picture is us CLB, It is a load balancing component within Tencent .CLB Here's our access node CN. stay DN Data storage is close to 4 Months of data stored in the configured device, the main use is SSD The equipment , Ensure data access performance ,4 Months ago, the data is stored on ordinary equipment, which reduces the storage cost . In addition, we also use the strategy of large and small businesses , Solve the data skew problem of different volume merchants , Effectively ensure the stable operation of the system . After that , It reduces the cost of our whole business to about a quarter , Help businesses reduce costs by three-quarters .
2. The seventh national census
stay 2020 In the seventh national census project , Tencent cloud database TDSQL It supports billions of user data , And the scenario requirements of massive super large tables associated with high concurrency statistical query . From the first day of the seventh national census , System query rate per second （QPS） And it soared to 7 ten thousand , The peak has reached 11 All around . Seven people general project involves multiple super large tables associated with highly concurrent statistical queries , Each table contains more than 20 Billion + Bar record . If one of the forms is used to store the average 50 Ten thousand words of books , Can put down more than 1000 Ten thousand copies , One can't read it all his life . In the project TDSQL-PG Took on a very heavy analysis task , At the same time, it has the ability of real-time data writing and massive data computing and analysis .