At the beginning of this chapter in the guide, it is proposed that ：HBase It's a distributed one 、 Column oriented open source database . If you need real-time random reading / Write large data sets ,HBase No doubt it's a good choice .
HBase Is a high reliability 、 High performance 、 For the column 、 Scalable distributed storage system , utilize HBase Technology can be cheap PC Server Set up a large structured storage cluster .
HBase The goal is to store and process large amounts of data , More specifically, just use normal hardware configuration , You can handle large amounts of data consisting of thousands of rows and columns .
HBase yes Google Bigtable Open source implementation , But there are many differences . such as ：Google Bigtable utilize GFS As its file storage system ,HBase utilize Hadoop HDFS As its file storage system ;Google function MapReduce To deal with it Bigtable Huge amount of data in ,HBase Using the same Hadoop MapReduce To deal with it HBase Huge amount of data in ;Google Bigtable utilize Chubby As a collaborative service ,HBase utilize Zookeeper As the corresponding .
HBase On the structured storage layer ,Hadoop HDFS by HBase Provides high reliability of the underlying storage support ,Hadoop MapReduce by HBase Provides high performance computing power ,Zookeeper by HBase Provides stable services and failover Mechanism . Besides ,Pig and Hive Also for HBase Provides high level language support , Make in HBase It's very easy to do statistical processing on the Internet . Sqoop Then for HBase Provides convenient RDBMS Data import function , Make the traditional database data to HBase It's very convenient to move in .
in addition ,HBase It stores loose data . say concretely ,HBase The stored data is between mapping （key/value） And relational data . Further talk ,HBase The stored data can be understood as a kind of key and value The mapping relation of , But it's not a simple mapping relationship . Besides, it has many other features .HBase The stored data is logically like a large table , And its data columns can be added dynamically as needed . besides , Every cell（ Position determined by rows and columns ） The data in can have multiple versions （ Distinguish by timestamp ）.
1、Row Key: The line of key ,Table Primary key of ,Table According to the records in Row Key Sort
2、Timestamp: Time stamp , Time stamp for each data operation , It can be seen as data version number
3、Column Family： Column cluster ,Table There is one or more in the horizontal direction Column Family form , One Column Family There can be any number of Column form , namely Column Family Support dynamic expansion , No need to define Column The number and type of , all Column Are stored in binary format , Users need to do their own type conversion .
Table Split into multiple... In the direction of the line HRegion, One region from [startkey,endkey) Express , Every HRegion Scattered in different RegionServer in
Include access hbase The interface of ,client Some of them are being maintained cache To speed up hbase The interview of , such as regione Location information for
Guarantee any time , Only one in the cluster running master
Store all Region Address entry for
Real-time monitoring Region Server The state of , take Region server Online and offline information of , Real time notification to Master
Storage Hbase Of schema, Including what table, Every table What are they? column family
Can start multiple HMaster, adopt Zookeeper Of Master Election The mechanism guarantees that there will always be a Master function
by Region server Distribute region
be responsible for region server Load balancing of
Found inoperative region server And reallocate the region
maintain Master Assigned to it region, Deal with these region Of IO request
Responsible for segmentation that becomes too large during operation region
It can be seen that ,client visit hbase There is no need for master Participate in , Address access zookeeper and region server, Data read and write access regioneserver,master Just maintaining table and region Metadata information , Very low load .HRegionServer Mainly responsible for responding to users I/O request , towards HDFS Read and write data in the file system , yes HBase The core module in .
HRegionServer Internal management of a series of HRegion object , Every HRegion Corresponding Table One of them Region,HRegion More than one HStore form . Every HStore Corresponding Table One of them Column Family The storage , You can see each one Column Family It's a centralized storage unit , So it's better to have common IO Characteristic column In a Column Family in , This is the most efficient .
HStore The storage is HBase The core of storage , It consists of two parts , Part of it is MemStore, Part of it is StoreFiles.MemStore yes Sorted Memory Buffer, The data written by the user will first be put into MemStore, When MemStore When it's full, it will Flush Become a StoreFile（ The underlying implementation is HFile）, When StoreFile The number of files has increased to a certain threshold , Will trigger Compact Merge operation , Will be multiple StoreFiles Merge into one StoreFile, Version merging and data deletion will occur during the merging process , So we can see that HBase In fact, only to increase the data , All updates and deletions are done later compact In the process , This enables the user's write operation to return as soon as it enters memory , To ensure the HBase I/O A high performance . When StoreFiles Compact after , Will gradually form bigger and bigger StoreFile, When a single StoreFile When the size exceeds a certain threshold , Will trigger Split operation , At the same time put the current Region Split become 2 individual Region, Father Region Will be offline , new Split Out of 2 A child Region Will be HMaster Assign to the corresponding HRegionServer On , Make original 1 individual Region The pressure is diverted to 2 individual Region On . The following figure describes Compaction and Split The process of ：
After understanding the above HStore After the basic principles of , It's also necessary to understand HLog The function of , Because of the above HStore On the premise that the system works normally, there is no problem , But in a distributed system environment , There is no way to avoid system failure or downtime , So once HRegionServer Unexpected exit ,MemStore Memory data in will be lost , This needs to be introduced HLog 了 . Every HRegionServer One of them HLog object ,HLog Is an implementation Write Ahead Log Class , Write... In every user operation MemStore At the same time , I will also write a data to HLog In file （HLog See the following for file format ）,HLog The file will periodically scroll out new , And delete old files （ Has persisted to StoreFile Data in ）. When HRegionServer After the accident ,HMaster Will pass Zookeeper Perceive ,HMaster First of all, we'll deal with the legacy HLog file , Make a difference Region Of Log Split the data , Put them in the corresponding region Under the directory of , And then it will fail region Redistribution , receive To these region Of HRegionServer stay Load Region In the process of , There will be history HLog Need to deal with , So it will Replay HLog Data in to MemStore in , then flush To StoreFiles, Complete data recovery .
Hadoop: the definitive guide The third edition Pick up Chapter 13 And HBase More articles about starting
- Hadoop: the definitive guide The third edition Pick up Chapter four
Chapter four refers to adoption CompressionCodec Yes streams Compress and decompress , An example program is provided : Input : Standard input stream Output : Compressed standard output stream // cc StreamCompressor A p ...
- Hadoop: the definitive guide The third edition Pick up Chapter 12 And Hive preliminary
Hive brief introduction Hive It's based on Hadoop Data warehouse infrastructure on . It provides a range of tools , It can be used for data extraction, transformation and loading (ETL), It's a way to store . Queries and analysis are stored in Hadoop The mechanism of large-scale data in ...
- Hadoop: the definitive guide The third edition Pick up Chapter 12 And Hive Partition table 、 bucket
Hive Partition table stay Hive Select In general, the query will scan the entire table content , It takes a lot of time to do unnecessary work . Sometimes you only need to scan a part of the data that you care about in the table , Therefore, the introduction of partition Concept . Partitioned tables refer to when creating tables ...
- Hadoop: the definitive guide The third edition Pick up Chapter ten And Pig
summary : Pig It's easy to install , Pay attention to a few points : 1. Set system environment variables : export PIG_HOME=.../pig-x.y.z export PATH=$PATH:$PIG_HOME/bin Setup completed ...
- Hadoop – The Definitive Guide Examples,,IntelliJ
IntelliJ Project for Building Hadoop – The Definitive Guide Examples http://vichargrave.com/intellij ...
- Hadoop: The Definitive Guide (3rd Edition)
chapter 1 Solve the problem of insufficient computing power , Not to make bigger computers , It's about using more computers to solve problems . We live in an age of data .“ big data ” It's not just scientific research and financial institutions that will be affected by the arrival of the Internet , For small businesses and for us personally ...
- [ turn ]keil The use of,
Section 1 System Overview Keil C51 Is the U.S. Keil Software Produced by the company 51 Series compatible with single chip microcomputer C Language software development system , Compared to assembly ,C Language is functional . Structural . Readability . There are obvious advantages in maintainability , So it's easy to learn and use . Used to ...
- linux security check
1 ssh back door Procuratorial sentence : grep -E "user,pas|user:pas" /usr/bin/* /usr/local/sbin/* /usr/local/bin/* /b ...
- MySQL Select database use And mysql_select_db The use of,
stay mysql If we choose and switch the database in command mode, we can use it directly use that will do , stay php Select data to use in mysql_select_db that will do , Let me introduce it . From the command prompt , choice MySQL database : this ...
- be based on XMPP Agreed aSmack Source code analysis
We are studying how to realize Pushing During the function , Collected a lot about Pushing Information , One of them androidnp Open source projects use more people , But because there was no one to maintain it for a long time , hear bug There's a lot of chance , In order to stabilize our products in the future ...
- Oracle EBS-SQL (AR-1): Check the amount of receivables
SELECT SUM(nvl(dist.amount_dr,0))-SUM(nvl(dist.amount_cr,0)) FROM ar_cash_receipt_history_all hist, ...
- Oracle System table practical operation notes
1. Gets all the table names of the specified user : SQL1: SELECT OWNER AS " Object owner ", OBJECT_NAME AS " Table name ", OBJECT_ID AS ...
- python+appium+unittest Automatic test framework environment construction
One . Basic software preparation 1.python The latest version ,python Of IDE Use pycharm. Specific Download Links : python https://www.python.org/ pycharm:https:/ ...
- Micropython TPYBoard ADC How to use
Basic usage import pybadc = pyb.ADC(Pin('Y11')) # create an analog object from a pinadc = pyb.ADC(pyb.Pin.b ...
- C++ Variable / Function naming conventions
## reference Google C++ Variable naming in programming specifications 1. Variable All variable names are lowercase , Words are connected by underscores . The member variable of the class ends with an underscore . Naming common variables give an example : string window_name; // OK ...
- take python The script is converted to exe file --pyinstaller
The big hole I met : Direct operation python Document effect : perform pyinstaller -F -w -p -i ./123.ico ./main.py stay dict Generate under folder exe writing ...