Write it at the front
The background data storage of wechat evolves with the characteristics of wechat products , After several architectural changes , It is now a mature large-scale distributed storage system , Orderly management by thousands of heterogeneous machine cluster , To support billions of visits a day 、 Key value and PB Level of data .
As a mobile social application based on mobile phone , The data generated by most businesses in wechat has commonness ： Data key with time stamp information , And single user data is constantly generated over time . We call this kind of data time series based data . For example, publishing in the circle of friends , Or the data generated by business such as Bill flow of mobile payment can meet such characteristics . Data based on time series are naturally with distinct properties of heat and cold ―― This is determined by the physical characteristics of the mobile phone , Its limited size screen can only display data in separate screens , By sliding your fingers , Smooth and continuous access along the timeline ―― Usually data generated by , Slowly back to the earlier data . At the same time, businesses such as friend circle are application scenarios of information reading diffusion , This means that the background data they generate has the distinct characteristics of more reading and less writing .
In the actual application scenario of wechat , The main features of such data include ： Large amount of data 、 Large amount of visits 、 High importance . These characteristics are in the actual operation process of the current network , It's a big challenge for us , It mainly includes ：
Large amount of data , High storage capacity required ―― Data based on time series is usually not deleted , But as time goes by , The amount of data reached PB Level , The storage space needed is also increasing ;
Large amount of visits , The festival effect is obvious ―― Data based on time series is often the data generated by hot business , They have a high number of visitors , Basically billions of times a minute . Especially during the holidays , Instant traffic is three to five times more than usual ;
High importance , User perception is obvious , Once the data is lost , Causes the user to be unable to use the product normally , And as a result, the rate of complaints is high .
Expanding storage horizontally through heap machines can naturally meet the above challenges , However, under the premise of tight cost budget , The number of machines is limited . under these circumstances , The hot and cold classification architecture of massive data based on time sequence came into being . This architecture is just to deal with the growing data in the background , In order to make full use of machine resources , Give full play to the principles of various hardware media , Combined with the data, the heat and cold are clear 、 Read more and write less access features and develop and Design . It's based on the idea of data Tiering , According to the difference of data access heat and data amount in different time periods , Customize different service strategies , Expand storage boundaries vertically . Scale out storage is easy to understand , By adding the same type of machines to the original cluster ―― It must involve a round of historical data migration ―― Finally, the new and old machines are balanced , There is no difference between them in the provision of services . Under this scheme , Data flow horizontally , The system treats , Obviously, there is no shelter for thinking according to local conditions . And the architecture of vertically expanding storage provides such an idea ：
For hot data , Less data , But it has a large traffic , We want them to be memory resident, of course , Therefore, the system provides a memory layer with strong consistency guarantee , In response to sudden traffic , It can also be used without involving historical data migration , A separate 、 Dynamically and rapidly expand the memory layer .
For historical data , There is a large stock of data , But the number of visits undertaken is very limited , We certainly don't want to use expensive SSDs to store them , therefore , The system provides a cheap mechanical disk layer , And there is a transparent process of cold data stripping and batch sinking , Extract the historical data from the storage layer to the mechanical disk layer .
Through such a vertical stratification 、 The idea of individual expansion , It provides us with great flexibility , Solve the memory bottleneck faced by the storage layer during the festival , In the long run, it eases the cost pressure for us , Solve the disk capacity bottleneck faced by the storage tier .
Of course, a successful large-scale Distributed systems These are not enough , It must also include data multi copy replication policies and zoning Algorithm etc. , We should also have the ability to deal with the complex current network operation environment . We combine the service characteristics of each layer , A strong data consistency algorithm is developed , For example, the memory layer ensures the complete consistency with the storage layer through version number control , The storage layer passes through Paxos Group Realize multi copy disaster recovery , The mechanical disk layer is guaranteed by serial writing . We also implement our own decentralized data routing algorithm , Ensure even distribution of data and traffic , And ensure that this feature is still true after horizontal expansion .
By working hard as above , Interlocking , Our hot and cold layered architecture of massive data based on time series successfully responded PB Level data 、 Hundreds of billions of access and trillions of key value challenges .
The system design
The hot and cold classification architecture of massive data mentioned in this paper is dedicated to data based on time sequence , They are characterized by ：
a). Data key with time stamp information ;
b). Single user data is constantly generated over time .
The architecture we design is strongly dependent on features a), Each link basically depends on the time stamp in the key value to distribute data or sort data . As for how to generate timestamps in key values 、 Whether the overall situation maintains a uniform time 、 How to maintain is beyond the scope of this paper , Usually this is determined by the business characteristics of the front end and the time server strategy of the back end .
And characteristics b) It ensures the necessity of this architecture 、 practical . If the data size is limited , Take the user's account information for example , It's like a household register in our daily life , It has only one share , Not new for a single user . Then we usually use fixed machine cluster storage , And there are few changes . And what we're dealing with is the user's diary 、 Or write it down , They're generating new data every day .
Let's take an example of a cluster on the current network , This kind of business data has the following characteristics ：
1.、 Large amount of data ,PB Level data , Trillions of keys , And in the process of continuous generation , However, the newly generated data accounts for less than the historical stock data . The figure below shows the percentage of the cluster data in each time period .
2、 Large amount of visits , The peak is billions of visits per minute , Especially during the holidays , The user's enthusiasm can be converted into three to five times of the daily traffic . At the same time, it has distinct heat and cold 、 Read more and write less ( The proportion of reading and writing can even reach 100：1) The interview characteristics of , For example, the visit of doubling during the festival is usually the visit to the new data generated during the festival . The figure below shows the proportion of cluster access in each time period .
3、 High data security requirements , This kind of data is usually user sensitive data , Once lost , The rate of users' complaints is high .
The system consists of three levels , As the picture asks , They are the memory layer 、 Storage layer （ Hot data storage layer ） And the mechanical disk layer （ Cold data storage layer ）. From the timeline , They serve data from hot to cold . As shown in the figure below ：
From the perspective of client , The three floors are all side by side , The client will directly communicate with a machine in a certain layer . The specific difference lies in , The memory layer and the mechanical disk layer are read-only to the client . All writes are written directly from the client to the storage layer . We distribute the decentralized configuration to the client machine , The type of configuration includes memory layer routing 、 Storage layer routing and other metadata , The client is divided according to the time in the configuration and the traffic proportion , To decide whether to distribute the current read request to the memory layer or the specific machine of the storage layer . Configuration supports fast distribution and dynamic loading , Updates can be implemented in seconds .
The routing of the mechanical disk layer is transparent to the client , The storage layer holds the data link that sinks to the mechanical disk layer , The link contains the document number 、 Internal offset and size , And the client doesn't know about it . When a read request that has sunk data is distributed to a storage tier machine , The storage layer accountant calculates the corresponding machine address of each copy of the data in the cold data storage layer , Then reply it with the file link to the client . The client selects one copy to read between multiple copies according to the random strategy , Look at it this way , The cold data storage layer is more like a remote file system to the client , and inode Information and routing tables are placed in the hot data storage layer .
Next, we will analyze the design strategy of each layer in detail .
The memory layer behaves more like a cache agent , However, ordinary cache is weak in data processing efficiency . Common strategies such as write out , Before each write to the storage tier , Clean up the corresponding data in the cache first , Make sure it fails . However, data in the cache is usually multi copy , This solution can't handle network partition errors , And write out will produce many times RPC request , Excessive consumption of system resources . Another common strategy is limited data consistency , The strategy of obsolescence . When writing data to the cache , There will be a valid time , In this period of validity , The data has always been considered correct , Don't care what the real situation is . This kind of cache can only be applied to services that do not require high real-time data . For wechat's sensitive business , We need a distributed cache that can ensure strong and consistent data .
We do this by version number . We maintain a version number for each piece of data in the cache , There is also a copy of the corresponding in the storage tier . Only when the version number in the cache is consistent with the version number of the storage layer , The data in the cache is considered to be valid . therefore , Every time the client requests to read the memory layer , The cache layer will generate a read request and send it to the storage layer . In a RPC Identification of validity and update of expired data are completed in the request .
Intuitively , Strong consistent caching with this scheme does not reduce the access pressure of the storage layer . Because of the client's request to the cache layer , The request to the storage layer with the cache layer is 1：1 Of . However, the key point of this plan is , We successfully solved the memory bottleneck of the storage layer . The function of caching data in the storage layer , Transfer to the memory of the cache layer . Our requirement for the storage layer now is to cache as many version numbers as possible , Provide efficient version number access capability . In this sense , This strong consistency cache is the extension of storage layer memory . therefore , We call it the memory layer . Its advantage is that it can dynamically adjust the proportion of traffic , And it can be expanded rapidly during the peak visit period . In the following chapters, we also describe how to optimize the resource consumption caused by version number interaction through engineering means .
For the robustness of the system , Some unusual situations also need to be considered , If a memory layer machine suddenly goes offline , There will be dozens G The cache data of is invalid , Of course we don't want these tens G Data pressure , It will all fall on the disk of a storage machine .―― This will undoubtedly cause system jitter . therefore , We deployed the memory layer as a group . There are many machines in each group . A single data may have multiple copies in multiple machines . Clients access these machines in random order . In this way, we try to avoid the influence of single node failure on the whole system .
We designed a simple in memory layer 、 Lightweight cache structure supporting variable length data . Each machine contains dozens of LRU chain , Each chain is a one-dimensional array in the form of shared memory . All data is appended to the position of the array , Go to the tail and cycle from the beginning . natural , Such a structure requires an index to record the location of the data . It's a waste of memory , But it avoids the dynamic allocation of memory .
The storage layer is at the core of the whole system architecture . Both the memory layer and the hard disk layer depend on its implementation . Mentioned above , Providing efficient and lightweight version number access is the key to strong consistent memory layer implementation . meanwhile , The need to continuously sink cold data into the mechanical hard disk layer , It implies that there must be such a feature in the storage layer ： Cold data is easy to split from all data , And collected .―― That means , If the data is flat in the storage layer , Hot and cold data mixed together , So when we're pulling out the cold data , It's about traversing all the data in the hard disk , It will undoubtedly consume more system resources .
So we adopted lsm-tree Algorithm to achieve this requirement . The algorithm and B+ Trees are also a kind of indexing technology . The difference is that it's based on multiple components (C0\C1 And so on ), By delaying submission and merging sort , take B+ Random tree IO Changed to memory operation and order IO. Under our access model , All writes are hot data , It will only be submitted to C0 Components . And then at the right time , Same as C1 The data in the component is multiplexed and sorted . Through this algorithm , We can achieve the goal of data layering and data ordering at the same time .
Leveldb yes Google The company's open source repository , It is based on lsm-tree The idea of algorithm is developed . therefore , We reuse its mature data structure components , Such as log format 、 Data file format 、 Memory table format, etc . However, some of its runtime strategies , But it will bring troubles to our current network operation . For example, the runtime is unlimited compact Strategy 、 Lazy loading of data file index, etc , Will trigger uncontrollable reading , Cause service jitter ; At the same time, a large number of dynamic memory allocation will also bring some uncontrollable factors to the memory usage of the machine . therefore , We abandoned these runtime behaviors , Define your own management strategy , Make the system more controllable . meanwhile , We take advantage of different data access differences , To the cold 、 The storage of thermal data is customized , Define the granularity of block compression by time period 、 Index granularity, etc , An effective compromise CPU、 Memory 、 Disk capacity 、 disk IO And so on .
Cold data link and cold cluster routing table , It's all recorded in the storage layer and not visible to the front end . Specific design ideas , We will elaborate on in the next section .
Mechanical hard disk layer
Although the capacity of mechanical hard disk is large , however IO Poor performance , High failure rate . A common idea is that the cold data storage layer is independent of the hot data storage layer , It's directly visible to the client ―― The client holds a cold data storage layer route , And route alone ―― There is no doubt that it is simple 、 An easy to understand plan , But there are two problems in our application scenario ： It is unable to accurately air defense and aggravate the of mechanical hard disk layer IO nervous .
Definition TB The number of visitors is per TB Number of accesses to data per second . In our application scenario , Every time TB The actual access to historical data is set to N, The service capacity of the mechanical hard disk is only N Half of . If the cold data storage layer is independent , Then it needs to maintain all data indexes by itself , And the memory capacity is not enough to support tens of T Index of data , The index can only be dropped , Each time the data is read, two random reads are required . therefore , We will cold data index and routing table of cold data storage layer , It's all in the hot data storage layer , Not visible to the front end .
For disaster tolerance , We have to store multiple copies of each data . If we use the double copy scheme , Then the system needs redundancy 50% The ability to access , In case another copy fails , stay io On the premise of bottleneck , This kind of plan is not advisable . So we use the three copy scheme , Only one third of the capacity is redundant . Each copy is distributed in different parks , It can achieve disaster tolerance at the park level .
Because of the large capacity of the mechanical disk 、 Poor computing power , We use NO RAID The way the disk group is organized . In order to better realize the disaster recovery of data loss caused by single disk failure , We realize the same data of three machines in the same group at the disk level . In order to achieve disk level load balancing , We use precomputed routing 、 Hard coding , Realized ( data -> machine -> disc -> file ) Monotone mapping of , The key value of the data can be directly located to the index of the disk and the number of the file .
As a supplement to the mechanical hard disk layer , A cold data sink module is necessary , It acts as the of the cold data storage layer Writer, We have ensured the transparency of the sinking process through two-stage submission , Control the timing of process initiation to ensure that the use of resources does not affect the existing network services , By pre occupying 、 Serial write mode , Make sure that the data is completely consistent at the file level of the cold data storage layer .
Strong data consistency guarantee
Business requires the system to ensure strong consistency among multiple copies of data .―― This is an enduring challenge . We will divide the memory layer 、 Storage layer 、 The mechanical hard disk layer respectively considers the strong consistency maintenance of data .
Strong consistent caching
As described earlier , Memory layer as a strong consistency distributed cache , It's completely aligned with the storage tier , It can't judge the validity of data by itself , There is no need for interaction between multiple copies of itself . It's read-only for the front end , All write requests don't go through it , It's just a view of the data in the storage tier . So its commitment to front-end data validity depends entirely on the correctness of the storage layer .
We are based on Paxos Group Achieve the data consistency of the storage layer , By way of non tenancy , The system achieves greater availability on the premise of ensuring strong consistency .Paxos The algorithm is based on Lesile Lamport First mentioned in the paper , Its function is to determine a constant value among multiple participants .―― This is not directly related to distributed storage . We are Paxos Based on the algorithm , Design a message driven Paxos Log Components ―― Every operation log should be Paxos Algorithm to determine , Furthermore, it is based on Paxos Log Strong consistency in reading and writing .
Paxos Group Because of the ownerless model , All machines in the group are in the same position at any time .Paxos The essence of the algorithm is a multi copy synchronous write algorithm , If and only if the majority in the system accepts the same value , Will return to write successfully . So the failure of any single node , There will be no system unavailability .
The main problem with strong consistency writing protocols comes from Paxos The algorithm itself , Because make sure that the data is accepted by the majority in the system , It's a multi-stage interaction . We use the following methods , It's solved paxos Problems in algorithm Writing ： be based on fast accept Protocol optimizes write algorithm , Reduce the amount of disk writing and protocol message sending 、 Number of reception , Finally, it realizes the reduction of writing time and failure ; Based on random avoidance 、 Limit one time Paxos Write trigger Prepare Number of times etc , It's solved Paxos The livelock problem in .
Strong consistency reading protocol itself and Paxos The algorithm doesn't matter much , Just confirm the majority between multiple copies , Data that can be obtained . We get most of the machines in the cluster by broadcasting （ Include yourself ） Of paxos log The state of , Then judge the validity of the local data .
When a single node in the system fails , When the data is completely lost ―― This kind of situation can be regarded as Paxos Blind area of algorithm , Because the algorithm is based on the fact that all participants will not violate their promises , That is, data inconsistency caused by Byzantine failure .―― And this kind of situation is normal in the current network operation , therefore , We introduced Learner Only Pattern . In this mode, the fault machine only receives the submitted data , Not to participate in Paxos The process of writing the agreement , This means that we will not violate any commitment due to data loss . And then through asynchronous catch up And full data verification to quickly recover data from other replicas .
To prevent multiple nodes from failing at the same time , We distribute multiple copies of data on machines in different parks . Park is the concept of different data centers in the same city . such , Our structure is strong enough to deal with a disaster at the level of complete isolation in a single data center .
Because it's transparent to the client , Cold data sinking process as the writer of mechanical hard disk layer , Then the data consistency of this layer is easy to realize . We write through three copies serially 、 All submissions are considered successful to achieve data consistency among multiple replicas .
Supplement , Cold data clustering adds... To data blocks CRC Checksum consistency recovery queue , When stand-alone data is not available ( Lost or damaged ) when , First, the client will jump to another backup to read ( At the same time, the three machines provide reading service ), The consistency recovery queue will asynchronously recover the local data from other backup data blocks .
Because of the adoption of No Raid How to organize the disk group , And the data files of disk level between machines in the same group are consistent , In case of data loss caused by single disk failure , Just transfer data files from the same sequential disk of other machines .
Static mapping table
The main purpose of data partition is to ensure the load balance between machines on the same layer , And when the scale of the machine changes , In the end, it can still achieve load balancing .
The original intention of the classical consistent hash algorithm is to make the distributed cache robust , Addressing based on the dynamic calculation of hash value and virtual node at runtime . The difference between data storage and distributed cache is , Storage must guarantee the monotony of data mapping , Cache does not have this requirement , So classic consistent hashes usually use machines IP Wait as a parameter to hash , On the one hand, the result of this is that the data location changes from time to time , On the one hand, the load is usually unbalanced . So we modified this algorithm .
We pre calculate the random number of virtual nodes , Generated the mapping table between the cutting ring point and the entity machine . The mapping table can support up to 1000 groups of cluster size , Satisfied in any number of groups , The length of the segments between the solid machines remains different in 2% within ; And increase any number of groups ( The maximum number of groups is no more than one thousand ), After the change, the length of the cut segment between the physical machines remains different in 2% within . We hardcode this map , The calculation process is avoided at runtime , When the data is addressed according to the key hash value , After a binary search, you can get the number of the corresponding entity machine . We're in the memory layer 、 Both the storage layer and the mechanical hard disk layer use this mapping table , Ensure the consistency of data routing algorithm in each layer . In terms of engineering realization , We can reasonably use this feature to batch merge requests , To reduce resource consumption , This will be described in detail later in the chapter .
Intra group equilibrium
A group is a separate unit of data partition , Is the entity unit corresponding to the virtual node . Groups are independent of each other . Each group consists of multiple physical machines , This is a Paxos Group The basic unit in effect . Multiple copies of a piece of data are scattered on each machine in the group . In order to ensure load balancing on machines in the group , We also designed a set of algorithms , Specifies the access priority between data copies , The front end will request data one by one according to the priority , As long as you succeed in getting , That is to interrupt the process . Then we spread the copies evenly on the machines in the group according to the priority , In this way, load balancing within the group can be realized .
Static mapping tables are very flexible , Without reaching the maximum number of groups , You can add one or more groups of machines at will . Of course, the route mapping of some data has changed in this process , It involves the movement of historical data . In order not to affect the service in the process of totem , Make sure the data is still readable and writable , We developed a transparent front end , Based on the migration flags , It is safe through data double write and asynchronous data move 、 Fallback data migration process .
Minimum constant block
The storage layer and the mechanical hard disk layer are coupled together by cold data link . Because the two layers use the same mapping table , When the storage layer is migrated due to expansion , So cold data links must be re addressed , A round of repositioning . If we record cold data link and sink cold data with single key value as granularity , So in the context of trillions of keys , There is no doubt that efficiency is low . So we design the algorithm of minimum invariant block , Through a two-stage hash , Use the middle hash bucket to aggregate data , Isolate the data key value from the machine route in the cold data storage layer . Through this algorithm , We can do that ： Batch transfer of cold data 、 The hot data storage layer is in blocks in batches (block) Record cold data links for units 、 When the hot data storage layer is expanded , block (block) The data in is not broken due to the expansion , It can be migrated to the new target machine as a whole .
Bad engineering can ruin a perfect system design , therefore , How to achieve in the process of Engineering , By means of technology , Improve the performance of the system , It's also worth paying attention to .
The design of memory layer relies heavily on the efficient acquisition of version number of data in storage layer , Naturally, all version number requests are in memory . therefore , In view of this situation, we have designed a set of minimalist version numbers with fixed length 、 A lightweight 、 Effective caching ―― Memory capacity is not enough to support version number full cache .
Its data structure is just a two-dimensional array , One dimension is used to build hash chain , One dimension is used to realize LRU chain . Every time you read or write, you need to move the data in the array , To update . In this way , Let's go through tens of millions of series LRU Chain group , Achieve the cache as a whole LRU Eliminate . It has a fixed length , Can share memory to carry , Process restart is not lost 、 High memory utilization .
The batch operation
For system servers , A request from the front end , The corresponding logical operations are all serial , Naturally, we can sort out the CPU Optimization of consumption point . However, when the main bottleneck is gradually eliminated ,CPU Consumption points become scattered , The optimization effect is very small . therefore , We can only look for other breakthroughs .
We found that in the storage engine 、 In the implementation process of the consistency protocol algorithm , There are many logical operation steps , It's about network interaction , Hard disk reading and writing . therefore , We decided to merge the same steps in different requests , Realize batch operation , Greatly optimized CPU Consume .
The cost of a merger is a slight increase in time , We separate by speed , Only for the logical operations in the hot data request , Remove the time-consuming instability factor , Reduce time-consuming jitter .
Since the logical operation performance of single machine has been greatly improved , So the network interaction stage of the front and back end , Including the package and unpacking of access layer 、 Protocol processing and so on , Become the main consumption point of resources . Refer to the experience of batch operation , We also use batch technology to optimize performance ―― A single request coming from the background (Get) Aggregate into a batch request in the memory layer (Batch Get).
Because each data is routed separately according to the key value , If you want to make a request merge , We have to make sure that the data in the same batch request , All address the same Paxos Group On . therefore , We have to drop on the same storage machine in the memory layer Get The requests come together . We first used the same routing algorithm in the memory layer and the storage layer , Then align the number of groups in the memory layer with the number of groups in the storage layer , To achieve this goal .
In the design stage , We have fully investigated various schemes in the industry , As large as the overall architecture of the system , Small to specific technical points . Each scheme has its own application scenarios 、 Each has its own merits , You can't just distinguish good from bad , We are also based on our own business scenarios , Choose the right one carefully achbak.dataguru.cn/attachments/portal/201706/29/112629ve4ase4twtfpdfrd.jpg)
For example , It's also based on the idea of thermal stratification , Design a storage scheme to serve their photo business data . The difference is that it uses a combination of hardware and software , On the one hand, customize the special server （ Including hard disk 、 Power supply, etc ） And data centers , On the one hand, reduce the number of cold data backups 、 Add correction code and other means .
However, we can't apply their experience completely , Two main reasons ： The machine models we can use are fixed , There are no conditions for customizing your own hardware . At the same time, it deals with the big picture value The data of . And we are basically the text of this type of small value data . From the above TB From the perspective of traffic , They deal with data that is capacity bottleneck , And what we're dealing with is IO Bottleneck , It can be regarded as the challenge of not too cold cold data . therefore , We can only implement our own cold data management strategy .
Again , There are many solutions about how to achieve data consistency in the industry . Including our wechat self-developed Quorum agreement , It is a kind of NWR agreement , Asynchronous synchronization is adopted to realize multiple copies of data . It's asynchronous synchronization , That's when multiple copies reach final agreement , There must be a time difference , So when a single computer is offline , There is a certain probability that the data will not be available . And what we're after is a single point of failure , All data is guaranteed to be highly available .
therefore , We have adopted the decentralized Paxos Group This goal has been achieved , The non lease is PaxosStore An innovative highlight of Architecture . In case of failure, the system usually vibrates , There will be intermittent situations , Common lease practices are prone to repeatedly switch hosts in this scenario, resulting in long-term unavailability , and PaxosStore The non tenancy structure of can easily deal with , Always provide good service .PaxosStore The core code is in the process of sorting out open source , It is expected to be officially released in the fourth quarter , At the same time, the underlying framework of the project is also based on our open-source collaboration library github.com/libco.
*** This article is transferred from the wechat background team , In case of infringement , Please contact us to delete immediately ***
OpenIMgithub Open source address ：
OpenIM Official website ： https://www.rentsoft.cn
OpenIM Official forum ：https://forum.rentsoft.cn/
More technical articles ：
Open source OpenIM： High performance 、 Telescopic 、 Extensible instant messaging architecture
【OpenIM original 】 Easy to get started One article explains WebRTC Realization 1 Yes 1 Audio and video communication principle
【OpenIM original 】 Open source OpenIM： Light weight 、 Efficient 、 real time 、 reliable 、 Low cost message model