In the previous article 《 Graphic database Neo4J brief introduction 》 in , We introduce a very popular graphic database Neo4J How to use . And in this article , We're going to look at another type of NoSQL database ——Cassandra Give a brief introduction .

Contact Cassandra Cause and contact Neo4J For the same reason : Our products need to be able to record a large number of data that cannot be processed quickly by a series of relational databases .Cassandra, And what we'll cover later MongoDB, It's an alternative in the process of technology selection . Although in the end, we have no choice Cassandra, But there are a series of internal mechanisms in the whole process of technology selection , The way of thinking is very interesting . And it is also used for reference in the whole selection process CAM(Cloud Availability Manager) Some experience gained by the group in practical use . So I'm here to summarize my notes into an article , Share it out .

Technology selection

Technology selection is often a very rigorous process . Because a project is usually developed by tens or even hundreds of developers , Therefore, a precise technology selection can greatly improve the development efficiency of the whole project . When trying to design a solution for a class of requirements , We often have many technologies to choose from . In order to be able to accurately select a technology suitable for these needs , We need to consider a series of learning curves , Development , Maintenance and many other factors . These factors mainly include :

  • Whether the function provided by the technology can solve the problem completely .
  • How extensible is the technology . Allow users to add custom components to meet special needs .
  • Whether the technology has rich and complete documents , And get professional support in the form of free or even paid .
  • Is the technology used by many people , Especially in some large enterprises , And there are successful cases .

In the process , We will gradually screen all kinds of technologies available on the market , And finally determine the one that suits our needs .

For the needs we just mentioned —— Record and process large amount of data automatically generated by the system , We have many choices in the initial stage of technology selection :Key-Value database , Such as Redis,Document-based database , Such as MongoDB,Column-based database , Such as Cassandra etc. . And when implementing specific functions , We can often build a solution through any of the databases listed above . so to speak , How to choose between these three databases is often NoSQL The most headache for database beginners . One reason for this is ,Key-Value,Document-based as well as Column-based Actually, it's true NoSQL A general classification of database . Provided by different database providers NoSQL Databases often have slightly different implementations , It also provides different function sets , As a result, the boundaries between these database types are not so clear .

As the name suggests ,Key-Value The database stores data in the form of key value pairs . It often records data through the structure of hash table . When use , Users only need to pass Key To read or write the corresponding data . So it's doing CRUD Very fast in operation . And its defects are just as obvious : We can only access data through keys . besides , The database does not know other information about the data . So if we need to filter the data according to a specific pattern , that Key-Value The operation efficiency of database will be very low , And that's because right now Key-Value Databases often need to scan all existing Key-Value Data in the database .

So in a service ,Key-Value Database is often used as server cache , To record the results of a series of complex and time-consuming calculations . The most famous is Redis. Of course , by Memcached With persistence added MemcacheDB It is also a kind of Key-Value database .

Document-based Database and Key-Value The main differences between databases are , The data it stores will no longer be strings , It's a document with a specific format , Such as XML or JSON etc. . These documents can record a series of key value pairs , Array , Even embedded documents . Such as :

 {
Name: "Jefferson",
Children: [{
Name:"Hillary",
Age: 14
}, {
Name:"Todd",
Age: 12
}],
Age: 45,
Address: {
number: 1234,
street: "Fake road",
City: "Fake City",
state: "NY",
Country: "USA"
}
}

Some readers may have questions , We can also pass Key-Value Database to store JSON or XML Formatted data , Isn't it ? The answer is Document-based Databases often support indexes . We just mentioned ,Key-Value The efficiency of database is very poor in data searching and filtering . With the help of index ,Document-based The database can support these operations well . There are some Document-based Databases even allow you to perform things like relational databases JOIN operation . And compared with the relational database ,Document-based The database will also Key-Value Database flexibility is preserved .

and Column-based The database is very different from the previous two databases . We know , Data recorded in a relational database is often organized by rows . Each row contains multiple columns representing different meanings , And is sequentially recorded in the persistence file . We know , A common operation in relational database is to filter and operate data with specific characteristics , And this is often done by WHERE Clause to complete :

 SELECT * FROM customers WHERE country='Mexico';

In a traditional relational database , The table the statement operates on might look like this :

In the database file corresponding to the table , The values in each row are recorded sequentially , Thus, the data file as shown in the figure below is formed :

So in the implementation of the above SQL When the sentence is , Relational databases do not continuously operate on data recorded in files :

This greatly reduces the performance of relational databases : To run this SQL sentence , The relational database needs to read the id Domain and name Domain . This will result in a significant increase in the amount of data to be read by the relational database , You also need to perform a series of offset calculations when accessing the required data . Besides, the example above is only the simplest table . If the table contains dozens of columns , Then the amount of data read will increase dozens of times , Offset calculation will also become more complex .

So how can we solve this problem ? The answer is to keep the data in a column together :

And this is Column-based The core idea of database : Record data in a data file by column , For better request and traversal efficiency . Here are two points to note : First ,Column-based The database does not mean that all data will be organized by columns , There's no need . Some data that needs to be requested can be stored in columns . The other point is ,Cassandra Yes Query Support for is actually associated with the data model it uses . in other words , Yes Query The support is limited . We will cover this limitation in the following sections .

So far, , You should be able to choose a suitable database for your needs based on the characteristics of various databases NoSQL Database .

Cassandra First experience

OK, In a brief introduction Key-Value,Document-based as well as Column-based Three different types of NoSQL After database , We're going to start trying to use it Cassandra 了 . Since I am using a series of NoSQL The lack of version update is often encountered in databases API Backward compatibility , I used it directly here Datastax Java Driver Example . In this way, the reader can also view the sample code for the latest version of the client from this page .

The simplest way to read a record Java The code is as follows :

Cluster cluster = null;
try {
// Create a connection to Cassandra The client of
cluster = Cluster.builder()
.addContactPoint("127.0.0.1")
.build();
// Create user session
Session session = cluster.connect(); // perform CQL sentence
ResultSet rs = session.execute("select release_version from system.local");
// Take the first result from the returned result
Row row = rs.one();
System.out.println(row.getString("release_version"));
} finally {
// call cluster Variable close() Function and close all links associated with it
if (cluster != null) {
cluster.close();
}
}

It looks simple , Really? ? In fact, with the help of the client , operation Cassandra Actually, it's not very difficult . In turn, , How to Cassandra The design model of the recorded data is the most important one for readers to consider . Different from the most familiar relational database modeling method ,Cassandra The data model design needs to be Join-less Of . In short , That's because the data is distributed Cassandra On different nodes of , So these data Join Operations cannot be performed efficiently .

So how do we define models for these data ? First we need to understand Cassandra Basic data model supported . These basic data models are :Column,Super Column,Column Family as well as Keyspace. Let's briefly introduce them .

Column yes Cassandra The most basic data model supported . The model can contain a series of key value pairs :

 {
"name": "Auther Name",
"value": "Sam",
"timestamp": 123456789
}

Super Column It contains a series of Column. In a Super Column The attribute in can be a Column Set :

 {
"name": "Cassandra Introduction",
"value": {
"auther": { "name": "Auther Name", "value": "Sam", "timestamp": 123456789},
"publisher": { "name": "Publisher", "value": "China Press", "timestamp": 234567890}
}
}

What needs to be noted here is ,Cassandra Documents are no longer recommended for excessive use Super Column, But the reason is not directly explained . It is said that this sum Super Column It is often necessary to perform deserialization correlation during data access . One of the most common evidence is , There are often developers on the network Super Column Too much data added to , And lead to Super Column Related requests run slowly . Of course, it's just a guess . But now that the official documents are all right Super Column Be cautious , So we also need to avoid using it in daily use Super Column.

And one Column Family It's a series of Column Set . In this collection , Every Column There will be a key associated with it :

 Authers = {
“1332”: {
"name": "Auther Name",
"value": "Sam",
"timestamp": 123456789
},
“1452”: {
“name”: “Auther Name”,
“value”: “Lucy”,
“timestamp”: 012343437
}
}

above Column Family The example contains a series of Column. besides ,Column Family It can also contain a series of Super Column( Please use with caution ).

Last ,Keyspace It's a series of Column Family Set .

Found? ? There's no way to go through one of these Column(Super Column) Reference to another Column(Super Column), Only through Super Column Include other Column To complete the inclusion of this information . This is very different from the way we use foreign key to associate with other records in the process of relational database design . Remember the name of the method we used to create data association through foreign key ? Right ,Normalization. This method can effectively eliminate redundant data in relational database through the association indicated by foreign key . And in the Cassandra in , The way we use it is Denormalization, That is to say, a certain degree of acceptable data redundancy is allowed . in other words , These associated data will be recorded directly in the current data type .

In the use of Cassandra when , What should not be abstracted as Cassandra Data model , Which data should have an independent abstraction ? It all depends on the read and write requests that our applications often perform . Think about why we use Cassandra, Or say Cassandra Advantages over relational databases : Fast execution of read or write requests on massive data . If we abstract the data model only according to what we operate , And ignore it Cassandra Execution efficiency on top of these models , Even these data models cannot support the corresponding business logic , So we're right Cassandra There is no practical significance in using . So a more correct way is to : First, define an abstract concept according to the requirements of the application , And start to design the request to run on the abstract concept and the business logic of the application . Next , Software developers can then use these requests to decide how to model these abstractions .

When abstracting design models , We often have to face another problem , That's how to specify each Column Family Various keys used . stay Cassandra In various related documents , We often come across a series of key terms :Partition Key,Clustering Key,Primary Key as well as Composite Key. So what do they mean ?

Primary Key It's actually a very general concept . stay Cassandra in , Its representation is used to Cassandra Gets one or more columns of data in :

 create table sample (
key text PRIMARY KEY,
data text
);

In the example above , We have designated key Domain as sample Of PRIMARY KEY. And when needed , One Primary Key It can also be composed of multiple columns :

 create table sample {
key_one text,
key_two text,
data text,
PRIMARY KEY(key_one, key_two)
};

In the example above , We created Primary Key It's a two column key_one and key_two Composed of Composite Key. Among them Composite Key The first component of is called Partition Key, The latter components are called Clustering Key.Partition Key Used to decide Cassandra Which node in the cluster will be used to record the data , Every Partition Key Corresponding to a specific Partition. and Clustering Key Is used in Partition Internal sorting . If one Primary Key Contains only one domain , Then it will only have Partition Key But not Clustering Key.

Partition Key and Clustering Key It can also be composed of multiple columns :

 create table sample {
key_primary_one text,
key_primary_two text,
key_cluster_one text,
key_cluster_two text,
data text,
PRIMARY KEY((key_primary_one, key_primary_two), key_cluster_one, key_cluster_two)
};

But in a CQL In the sentence ,WHERE The conditions marked by the clauses can only be used in Primary Key Columns used in . You need to decide which should be based on your data distribution Partition Key, What should be Clustering Key, To sort the data .

A good Partition Key Design often greatly improves the performance of programs . First , because Partition Key Used to control which node records data , therefore Partition Key It can be determined whether the data can be distributed more evenly Cassandra On each node of , To make full use of these nodes . At the same time Partition Key With the help of the , Your read requests should use as few nodes as possible . This is because when a read request is executed ,Cassandra Need to coordinate the processing of data set from each node . So in response to a read operation , Fewer nodes can provide higher performance . So in the model design , How to specify the Partition Key Is a key in the whole design process . A uniformly distributed , Fields that are often used as input criteria in requests , It's often a consideration Partition Key.

besides , We should also consider how to set up the model Clustering Key. because Clustering Key Can be used in Partition Internal sorting , Therefore, it has better support for various requests including scope filtering .

Cassandra Internal mechanism

In this section , We will be on Cassandra A series of internal mechanisms of . Many of these internal mechanisms are commonly used solutions in the industry . So I learned Cassandra After how to use them , You can easily understand the use of these mechanisms by other class libraries , Even use them in your own projects .

These common internal mechanisms are :Log-Structured Merge-Tree,Consistent Hash,Virtual Node etc. .

Log-Structured Merge-Tree

The most interesting data structure is Log-Structured Merge-Tree.Cassandra Similar structure is used internally to improve the running efficiency of service instances . So how does it work ?

In short , One Log-Structured Merge-Tree It mainly consists of two tree structured data : Existing in memory C0, And the C1

When adding a new node ,Log-Structured Merge-Tree A record about the insertion of this node will be added to the log file first , Then insert the node into the tree C0 in . Records added to log files are mainly based on data recovery considerations . After all C0 Tree in memory , Very vulnerable to system downtime and other factors . When reading data ,Log-Structured Merge-Tree Will first try from C0 Find data in tree , And then in C1 Search in tree .

stay C0 After the tree meets certain conditions , It uses too much memory , Then the data it contains will be migrated to C1 in . stay Log-Structured Merge-Tree In this data structure , This operation is called rolling merge. It will bring C0 A series of records in a tree are merged into C1 In the tree . The results of merging will be written to the new contiguous disk space .

Almost The paper Original picture in

From a single tree ,C1 And what we know B Trees or B+ Trees look a bit like this. , It is not ?

I don't know if you notice . The introduction above highlights a word : Successive . This is because C1 Each node of the same level in the tree is continuously recorded in the disk . In this way, the disk can be read continuously to avoid excessive seek on the disk , Thus greatly improving the operation efficiency .

Memtable and SSTable

good , We just mentioned Cassandra Internal use and Log-Structured Merge-Tree Similar data structure . So in this section , We will be right Cassandra The main data structure and operation flow of . so to speak , If you have a general understanding of the previous section, yes Log-Structured Merge-Tree Explanation , So it will be very easy to understand these data structures .

stay Cassandra There are three very important data structures in : Recorded in memory Memtable, And the Commit Log and SSTable.Memtable Record recent changes in memory , and SSTable On the disk Cassandra Most of the data carried . stay SSTable A series of key value pairs arranged according to the key are recorded internally . Usually , One Cassandra The watch will correspond to a Memtable And multiple SSTable. besides , To improve the speed of data search and access ,Cassandra It also allows software developers to create indexes on specific columns .

Given that data may be stored in Memtable, It may also have been persisted to SSTable in , therefore Cassandra When reading data, you need to merge the data from Memtable and SSTable Data obtained . At the same time, in order to improve the running speed , Reduce unnecessary right SSTable The interview of ,Cassandra Provides a type of Bloom Filter The composition of : Every SSTable There is one. Bloom Filter, In order to determine its relevance SSTable Whether to include one or more data requested by the current query . If it is ,Cassandra Will attempt to SSTable Retrieve data from ; If not ,Cassandra The SSTable, To reduce unnecessary disk access .

Through Bloom Filter To determine the relationship between SSTable After the data required by the request is included ,Cassandra Will start to try SSTable Data out of . First ,Cassandra Will check the Partition Key Cache Whether the index entry of the required data is cached Index Entry. If there is , that Cassandra Directly from Compression Offset Map Query the address of the data in , And retrieve the required data from the address ; If Partition Key Cache The Index Entry, that Cassandra First of all, from Partition Summary Find Index Entry General location , And then search from that location Partition Index, To find the Index Entry. Find Index Entry after ,Cassandra You can start from Compression Offset Map Find the corresponding entry , And obtain the required data according to the displacement of the data recorded in the entry :

Slightly adjust the original map compared with the document

Found? ? actually SSTable The data recorded in is still the fields of sequential records , But here's the difference , Its search first passes through Partition Key Cache as well as Compression Offset Map Etc . These components contain only a series of corresponding relations , It is equivalent to continuously recording the data required by the request , Furthermore, it improves the speed of data search , Isn't it ?

Cassandra The write process of Log-Structured Merge-Tree The write process of is very similar :Log-Structured Merge-Tree The log in corresponds to Commit Log,C0 Trees correspond to Memtable, and C1 The tree corresponds to SSTable Set . when written ,Cassandra Write data to Memtable in , At the same time Commit Log Add the record corresponding to the write at the end of . In this way, in case of abnormal conditions such as power failure of the machine, etc ,Cassandra Still able to pass Commit Log To restore Memtable Data in .

After writing data continuously ,Memtable The size of . When its size reaches a threshold ,Cassandra The data migration process of will be triggered . On the one hand, the process will Memtable The data in is added to the corresponding SSTable At the end of , On the other hand, it will Commit Log Write record removal in .

This will also create a confusing problem for readers : If new data is written to SSTable At the end of , How to update data during data migration ? The answer is : When data needs to be updated ,Cassandra Will be in SSTable Add a record with the current timestamp at the end of , So that it can mark itself as the latest record . And the original SSTable The record in is then invalidated .

This can lead to a problem , That is, a large number of updates to the data will lead to SSTable The amount of disk space used is growing rapidly , And many of the data recorded in it are overdue data . So after a while , Disk space utilization will be greatly reduced . At this point, we need to compress SSTable To release the space occupied by the expired data :

Now there's a problem , That is, we can judge which is the latest data according to the time stamp of repeated data , But how should we deal with data deletion ? stay Cassandra in , The deletion of data is through a process called tombstone The composition of . If a piece of data is added tombstone, Then it will be considered as a deleted data in the next compression , So it will not be added to the compressed SSTable in .

During compression , The original SSTable And the new SSTable Exists on disk at the same time . These original SSTable To support data reading . Once new SSTable Creation completed , So old SSTable Will be deleted .

Here we will mention a few points in daily use Cassandra Problems in the process of . First of all , Because through Commit Log To rebuild Memtable It is a time-consuming process , So we need to rebuild Memtable Manual triggering of merging logic is required before a series of operations of , To Memtable Data in persistent to SSTable in . One of the most common needs to be rebuilt Memtable To restart Cassandra Node .

Another thing to watch out for , Don't overuse indexes . Although the index can greatly increase the reading speed of data , But we also need to maintain the data when it is written , Cause certain performance loss . At this point ,Cassandra It's not much different from the traditional relational database .

Cassandra colony

Of course , Using a single database instance to run Cassandra Not a good choice . Single server may cause single point failure of service cluster , And we can't make the most of it Cassandra Scale out capability of . So start with this section , We will be right Cassandra The cluster and the various mechanisms used in the cluster are briefly explained .

In a Cassandra Clusters often contain the following series of components : node (Node), Data Center (Data Center) And cluster (Cluster). The node is Cassandra The most basic structure used to store data in a cluster ; Data center is a set of nodes in the same geographical area ; Clusters often consist of multiple data centers in different regions :

As shown in the figure above Cassandra The cluster consists of three data centers . Two of the three data centers are in the same area , Another data center is in another area . so to speak , Two data centers in the same area are rare , however Cassandra The official documents of the . Each data center contains a series of nodes , For storage Cassandra Data to be hosted by the cluster .

With clusters , We need to use a series of mechanisms to achieve mutual cooperation between clusters , And consider a series of non functional requirements required by the cluster : State maintenance of nodes , Data dissemination , Extensibility (Scalability), High availability , Disaster recovery, etc .

Detecting the state of a node is the first step in high availability , It is also the basis of distributing data among nodes .Cassandra We used a method called yes Gossip Point to point communication scheme based on , In the Cassandra Each node in the cluster shares and transfers the state of each node . That's the only way ,Cassandra Only then can we know which nodes can save data effectively , And then the operation of data is distributed to each node .

In the process of saving data ,Cassandra Will use a so-called Partitioner To determine which nodes to distribute data to . Another component related to data storage is Snitch. It provides the ability to read and write data based on the performance of all nodes in the cluster .

These components also use a series of methods commonly used in the industry . for example Cassandra Through internal VNode To handle different hardware performance , So it forms a kind of similarity on the level of physical hardware 《 Introduction to enterprise load balancing 》 As mentioned in the article Weighted Round Robin Solutions for . Another example is its internal use Consistent Hash, We are also 《Memcached brief introduction 》 It is introduced in this paper .

Okay , Brief introduction completed . In the following sections , We will be right Cassandra These mechanisms are introduced .

Gossip

The first is Gossip. It is used to Cassandra A protocol for transmitting node state among nodes in a cluster . It will run once a second , And the current Cassandra The state of a node and its known other nodes are exchanged with up to three other nodes . In this way ,Cassandra The effective nodes of the cluster can quickly understand the status of other nodes in the current cluster . At the same time, the status information also contains a time stamp , With permission Gossip Determine which state is the updated state .

In addition to exchanging the state of each node in the cluster ,Gossip It also needs to be able to deal with a series of actions to operate the cluster . These operations include the addition of nodes , remove , Rejoin, etc . In order to better deal with these situations ,Gossip He put forward a name Seed Node The concept of . It is used to provide a startup for each newly added node Gossip Entry to exchange . Join in Cassandra After cluster , The new node can first try to follow a series of records Seed Node Exchange state . On the one hand, we can get Cassandra Information of other nodes in the cluster , It is then allowed to communicate with these nodes , You can also add your own information through these Seed Node Pass out . Because the node state information obtained by a node is often recorded in the disk and other persistent components , So after reboot , It can still communicate through these persistent node information , To rejoin Gossip In exchange for . In the case of a node failure , Other nodes will send detection messages to this node regularly , To try to resume connection with . But it will cause trouble for us to permanently remove a node : Other Cassandra The node always feels that it will rejoin the cluster at a certain time , Therefore, the detection information is always sent to the node . We need to use Cassandra The node tools provided .

that Gossip How to judge whether a node has failed ? If in the process of exchange , The other side of the exchange didn't answer for a long time , The current node will mark the target node as invalid , And then through Gossip The protocol passes the state out . because Cassandra The topology of a cluster can be very complex , Such as cross regional , Therefore, the criterion used to judge whether a node is invalid is not to judge whether it is invalid if there is no response within a long time . After all, it's a big problem : Two in the same Lab State exchange between nodes in will be very fast , Cross regional exchange is slower . If we set a shorter time , So cross region state exchange is often misreported as failure ; If we set a longer time , that Gossip The sensitivity of detecting node failure will be reduced . To avoid that ,Gossip It uses a kind of decision logic based on the past exchange history of nodes and many other factors . In this way, for two distant nodes , It will have a large time window , So no false alarm will be generated . But for two nodes which are close to each other ,Gossip A smaller time window will be used , So as to improve the sensitivity of detection .

Consistent Hash

What we're going to talk about next is Consistent Hash. The concept of bucket is often included in the common hash algorithm . Each hash calculation is to determine which bucket specific data needs to be stored in . And if the number of barrels changes , Then the previous hash calculation results will be invalid . and Consistent Hash It solves the problem well .

that Consistent Hash How does it work ? Consider a circle first , Multiple points are distributed on the circle , To represent an integer 0 To 1023. These integers are evenly distributed over the entire circle :

In the diagram above , We highlight the six blue dots that divide a circle into six equal parts , Represents the six nodes used to record data . Each of these six nodes will be responsible for a scope . for example 512 The node corresponding to the blue dot will change the record from the hash value to 512 To 681 Data in this range . stay Cassandra And other fields , This circle is called a Ring. Next, we will hash the data that needs to be stored , And get the hash value corresponding to the data . For example, the hash value of a piece of data is 900, So it's located in 853 and 1024 Between :

So the data will be blue dot 853 Corresponding node record . So if other nodes fail , The node of the data will not change :

How is the hash value of each piece of data calculated ? The answer is Partitioner. Whose input is data Partition Key. The calculation results are Ring The location on the node determines which node is responsible for data saving .

Virtual Node

We introduced it above Consistent Hash Operating principle . But there is a problem , That's what to do with the data on the failed node ? Can't we access it ? It's up to us to Cassandra Setting of cluster data replication . Usually , We will all enable this feature , So that multiple nodes record a copy of data at the same time . In case one of the nodes fails , Other nodes can still read the data .

One of the things to deal with here is , Each physical node has different capacity . In short , If one node can provide much less service capacity than other nodes , It will be overburdened by the same load . To deal with this situation ,Cassandra Provides a method called VNode Solutions for . In this solution , Each physical node will be divided into a series with the same capacity according to its actual capacity VNode. Every VNode Is used to be responsible Ring Previous data . For example, for the six node Ring, each VNode The relationship with the physical machine may be as follows :

In the use of VNode when , One thing we often need to pay attention to is Replication Factor Set up . In terms of its meaning ,Cassandra Medium Replication Factor And other common databases Replication Factor There is no difference : The value it has is used to represent the value recorded in the Cassandra How many copies of data in . For example, when it is set to 1 Under the circumstances ,Cassandra Only one copy of data will be saved . If it is set to 2, that Cassandra One more copy of the data will be saved .

In deciding Cassandra What the cluster needs to use Replication Factor when , We need to consider a number of factors :

  • Number of physical machines . Just imagine , If we were to Replication Factor Set to exceed the number of physical machines , So there must be a physical machine that keeps two copies of the same data . It doesn't really help : Once the physical machine is abnormal , It's going to cost you more than one piece of data at a time . So in terms of high availability ,Replication Factor When the value of exceeds the number of physical machines , The extra copies are not significant .
  • Heterogeneity of physical machines . The heterogeneity of physical machines often affects your settings Replication Factor The effect of . Take an extreme example . If we have one Cassandra Cluster and it consists of five physical machines . The capacity of one physical machine is that of the other 4 times . It will be Replication Factor Set to 3 The same data will be stored on the physical machine with large capacity . It is not better than setting to 2 How much good .

So I'm deciding on a Cassandra Clustered Replication Factor when , We should carefully set an appropriate value according to the number and capacity of physical machines in the cluster . Otherwise, it will only lead to more useless data copies .

notes : This article is written in 15 year 8 month . Whereas NoSQL Database development is very fast , And it often has a series of changes that affect backward compatibility ( Such as Spring Data Neo4J No longer supported @Fetch). So if you find any descriptions have changed , Please leave a comment , For other readers . Thank you so much

Please indicate the original address and reprint :http://www.cnblogs.com/loveis715/p/5299495.html

Please contact me in advance for business reprint :silverfox715@sina.com

The official account must help to avoid being labeled original. , Because it's too much coordination ...

Cassandra More related articles in the introduction

  1. Cassandra brief introduction

    Cassandra It's the best of the cloud native and micro service scenarios NoSQL database . I believe it. ~ 1. Cassandra What is it? High availability and scalable distributed database Apache Cassandra It's an open source distributed data , May I mention ...

  2. Brief introduction to Cassandra 【Cassandra brief introduction 】

    From wikipedia  https://en.wikipedia.org/wiki/CAP_theorem In theoretical computer science, the CAP t ...

  3. Open source software :NoSql database - Graph database Cassandra

    Reprint the original :http://www.cnblogs.com/loveis715/p/5299495.html Cassandra brief introduction In the previous article < Graphic database Neo4J brief introduction > in , We introduced ...

  4. Cassandra Use pycassa Bulk import data

    I took over one this week Cassandra Maintenance of the system , There is a need to import the data of the application into our maintenance Cassandra colony , And provide the application with HTTP How to access the service . This is my first contact with KV System , It turns out it's just a casual look ...

  5. Cassandra And Docker Environmental practice

    Cassandra brief introduction Cassandra Is an open source distributed NoSQL Database system . It was originally made by Facebook Development , Used to store simple format data such as inbox , Set GoogleBigTable Data model and Amazon D ...

  6. Data source management | Distributed NoSQL System ,Cassandra Cluster management

    In this paper, the source code :GitHub· Click here || GitEE· Click here One .Cassandra brief introduction 1. Basic description Cassandra It's a set of open source distributed NoSQL Database system . It was originally made by Facebook Development , Used to store the in tray ...

  7. What is? Cassandra database

    In this paper , We will introduce Cassandra The meaning of the name .Cassandra A brief history of the development of .Cassandra The characteristics and advantages of this technology , And the future of this technology . This article will be in an easy to understand way , To help you Cassandr ...

  8. Microservice Anti-patterns

    At the latest Microservices Practitioner Summit in , primary Netflix Engineers have introduced an increasingly common Microservice Misuse . In short , We are building a foundation Micros ...

  9. NoSql The journey --Cassandra Of Cql brief introduction ( Two )

    installed Cassandra After that, let's start to experience the query of this database , Traditional relational databases use sql The query , and Cassandra The use of cql. cql There's a lot of grammar , I will not elaborate on them one by one , It's not necessary , Specifically ...

Random recommendation

  1. winpcap Principle of bag grabbing

    winpcap Principle of bag grabbing WinPcap It's a packet capture library derived from Berkeley packet capture library , It's in Windows  Operating platform to achieve the interception and filtering of the underlying package .WinPcap yes BPF Models and Libpcap The function library is in ...

  2. php The realization uploads the picture to save to the database the method

    http://www.jb51.net/article/61034.htm author : Aoxue Xingfeng typeface :[ increase   Reduce ] type : Reprint   This article mainly introduces php The realization uploads the picture to save to the database the method , You can save the picture by ...

  3. 【Lucene4.8 Lesson four 】 analysis

    1. Basic content (1) Relevant concepts analysis (Analysis), stay Lucene It means that the domain (Field) The text is transformed into the primary index representation unit -- term (Term) The process of . During search , These items are used to determine what documents can match ...

  4. EL In the expression “+-x/” Four operators and conditions , Comparison operators, etc

    <%@page import="cn.hncu.domain.User"%><%@ page language="java" import=& ...

  5. mui.ajax With the server (SpringMVC) transmission json data

    Cross-domain problem PC For safety , So cross domain . And I use mui Make a move web when , It's hard to avoid using pc Browser debugging .mui.ajax Cross domain is allowed . In order to be able to debug successfully , The browser needs to be set up and . With 360 Take the browser as an example , Set up ...

  6. APUE Learning notes ——11 Thread basis

    Thread ID The thread is identified by the thread number . The thread number is only valid in the process environment to which the thread belongs . In other words, two threads belonging to different processes may have the same thread number . Structure for thread identification pthread_t tid Express . With threads Id The related functions are as follows : Compare the two ...

  7. spring @Autowired Injection principle

    Only know how to use Autowired annotation , I know I can replace set,get Method , Very convenient , But I didn't know , Why can it replace Today, explore why , The so-called "know what it is" also need to know why it is , To understand better , Better memory , Can be transformed into their own knowledge . ...

  8. linux Startup script

    linux Startup script linux Startup script User defined boot program (/etc/rc.d/rc.local) The easiest way to operate , convenient . Every time it starts itself PHP ah ,Nginx ah That's so annoying , There are other ways shell ah ...

  9. 【IDEA】 Reload basic settings + Plug in installation

    Basic configuration :2.1 Show :2.1.1. Select Show Toolbar2.1.2. Show memory usage :2.1.3. Show line numbers and method lines :2.1.4. Code soft branch :2.2. Modify shortcut key :2.2.1 modify Ctrl + D Shortcut key : ...

  10. vector Understand a wave of ~~~

    Vector: The header file : #include<vector> using namespacestd; Definition : vector< type >q;// Similar to   " type q[];&q ...