At the beginning of this chapter in the guide, it is proposed that :HBase It's a distributed one 、 Column oriented open source database . If you need real-time random reading / Write large data sets ,HBase No doubt it's a good choice .

brief introduction

HBase Is a high reliability 、 High performance 、 For the column 、 Scalable distributed storage system , utilize HBase Technology can be cheap PC Server Set up a large structured storage cluster . 
HBase The goal is to store and process large amounts of data , More specifically, just use normal hardware configuration , You can handle large amounts of data consisting of thousands of rows and columns . 
HBase yes Google Bigtable Open source implementation , But there are many differences . such as :Google Bigtable utilize GFS As its file storage system ,HBase utilize Hadoop HDFS As its file storage system ;Google function MapReduce To deal with it Bigtable Huge amount of data in ,HBase Using the same Hadoop MapReduce To deal with it HBase Huge amount of data in ;Google Bigtable utilize  Chubby As a collaborative service ,HBase utilize Zookeeper As the corresponding .

HBase On the structured storage layer ,Hadoop HDFS by HBase Provides high reliability of the underlying storage support ,Hadoop MapReduce by HBase Provides high performance computing power ,Zookeeper by HBase Provides stable services and failover Mechanism .  Besides ,Pig and Hive Also for HBase Provides high level language support , Make in HBase It's very easy to do statistical processing on the Internet .  Sqoop Then for HBase Provides convenient RDBMS Data import function , Make the traditional database data to HBase It's very convenient to move in . 
        in addition ,HBase It stores loose data . say concretely ,HBase The stored data is between mapping (key/value) And relational data . Further talk ,HBase The stored data can be understood as a kind of key and value The mapping relation of , But it's not a simple mapping relationship . Besides, it has many other features .HBase The stored data is logically like a large table , And its data columns can be added dynamically as needed . besides , Every cell( Position determined by rows and columns ) The data in can have multiple versions ( Distinguish by timestamp ).

Data model

1、Row Key: The line of key ,Table Primary key of ,Table According to the records in Row Key Sort
2、Timestamp: Time stamp , Time stamp for each data operation , It can be seen as data version number

3、Column Family: Column cluster ,Table There is one or more in the horizontal direction Column Family form , One Column Family There can be any number of Column form , namely Column Family Support dynamic expansion , No need to define Column The number and type of , all Column Are stored in binary format , Users need to do their own type conversion .

Row Key

Time Stamp

Column Contents

Column Anchor

Column “mime”

cnnsi.com

my.look.ca

“com.cnn.www”

T9

 

CNN

   

T8

   

CNN.COM

 

T6

“<html>.. “

   

Text/html

T5

“<html>.. “

     

t3

“<html>.. “

   

Physical storage

Table Split into multiple... In the direction of the line HRegion, One region from [startkey,endkey) Express , Every HRegion Scattered in different RegionServer in

Architecture system

1、Client

Include access hbase The interface of ,client Some of them are being maintained cache To speed up hbase The interview of , such as regione Location information for

2、Zookeeper

Guarantee any time , Only one in the cluster running master

Store all Region Address entry for

Real-time monitoring Region Server The state of , take Region server Online and offline information of , Real time notification to Master
  Storage Hbase Of schema, Including what table, Every table What are they? column family
3、Master

Can start multiple HMaster, adopt Zookeeper Of Master Election The mechanism guarantees that there will always be a Master function
by Region server Distribute region
be responsible for region server Load balancing of
Found inoperative region server And reallocate the region

4、Region Server
maintain Master Assigned to it region, Deal with these region Of IO request
Responsible for segmentation that becomes too large during operation region

It can be seen that ,client visit hbase There is no need for master Participate in , Address access zookeeper and region server, Data read and write access regioneserver,master Just maintaining table and region Metadata information , Very low load .HRegionServer Mainly responsible for responding to users I/O request , towards HDFS Read and write data in the file system , yes HBase The core module in .

HRegionServer Internal management of a series of HRegion object , Every HRegion Corresponding Table One of them Region,HRegion More than one HStore form . Every HStore Corresponding Table One of them Column Family The storage , You can see each one Column Family It's a centralized storage unit , So it's better to have common IO Characteristic column In a Column Family in , This is the most efficient .
           HStore The storage is HBase The core of storage , It consists of two parts , Part of it is MemStore, Part of it is StoreFiles.MemStore yes Sorted Memory Buffer, The data written by the user will first be put into MemStore, When MemStore When it's full, it will Flush Become a StoreFile( The underlying implementation is HFile), When StoreFile The number of files has increased to a certain threshold , Will trigger Compact Merge operation , Will be multiple StoreFiles Merge into one StoreFile, Version merging and data deletion will occur during the merging process , So we can see that HBase In fact, only to increase the data , All updates and deletions are done later compact In the process , This enables the user's write operation to return as soon as it enters memory , To ensure the HBase I/O A high performance . When StoreFiles Compact after , Will gradually form bigger and bigger StoreFile, When a single StoreFile When the size exceeds a certain threshold , Will trigger Split operation , At the same time put the current Region Split become 2 individual Region, Father Region Will be offline , new Split Out of 2 A child Region Will be HMaster Assign to the corresponding HRegionServer On , Make original 1 individual Region The pressure is diverted to 2 individual Region On . The following figure describes Compaction and Split The process of :
          After understanding the above HStore After the basic principles of , It's also necessary to understand HLog The function of , Because of the above HStore On the premise that the system works normally, there is no problem , But in a distributed system environment , There is no way to avoid system failure or downtime , So once HRegionServer Unexpected exit ,MemStore Memory data in will be lost , This needs to be introduced HLog 了 . Every HRegionServer One of them HLog object ,HLog Is an implementation Write Ahead Log Class , Write... In every user operation MemStore At the same time , I will also write a data to HLog In file (HLog See the following for file format ),HLog The file will periodically scroll out new , And delete old files ( Has persisted to StoreFile Data in ). When HRegionServer After the accident ,HMaster Will pass Zookeeper Perceive ,HMaster First of all, we'll deal with the legacy HLog file , Make a difference Region Of Log Split the data , Put them in the corresponding region Under the directory of , And then it will fail region Redistribution , receive To these region Of HRegionServer stay Load Region In the process of , There will be history HLog Need to deal with , So it will Replay HLog Data in to MemStore in , then flush To StoreFiles, Complete data recovery .

Hadoop: the definitive guide The third edition Pick up Chapter 13 And HBase More articles about starting

  1. Hadoop: the definitive guide The third edition Pick up Chapter four

    Chapter four refers to adoption CompressionCodec Yes streams Compress and decompress , An example program is provided : Input : Standard input stream Output : Compressed standard output stream // cc StreamCompressor A p ...

  2. Hadoop: the definitive guide The third edition Pick up Chapter 12 And Hive preliminary

    Hive brief introduction Hive It's based on Hadoop Data warehouse infrastructure on . It provides a range of tools , It can be used for data extraction, transformation and loading (ETL), It's a way to store . Queries and analysis are stored in Hadoop The mechanism of large-scale data in ...

  3. Hadoop: the definitive guide The third edition Pick up Chapter 12 And Hive Partition table 、 bucket

    Hive Partition table stay Hive Select In general, the query will scan the entire table content , It takes a lot of time to do unnecessary work . Sometimes you only need to scan a part of the data that you care about in the table , Therefore, the introduction of partition Concept . Partitioned tables refer to when creating tables ...

  4. Hadoop: the definitive guide The third edition Pick up Chapter ten And Pig

    summary : Pig It's easy to install , Pay attention to a few points : 1. Set system environment variables : export PIG_HOME=.../pig-x.y.z export PATH=$PATH:$PIG_HOME/bin Setup completed ...

  5. Hadoop – The Definitive Guide Examples,,IntelliJ

    IntelliJ Project for Building Hadoop – The Definitive Guide Examples http://vichargrave.com/intellij ...

  6. reread 《 Study JavaScript Data structure and algorithm - The third edition 》- The first 5 Chapter queue

    Set poem The horse is thin, long and fat , A son who steals his father is not a thief , The blind man married a blind grandmother , The old couple passed more than half a generation , No one saw who ! Preface This chapter is for rereading < Study JavaScript Data structure and algorithm - The third edition > A series of articles , Mainly about the queue data ...

  7. reread 《 Study JavaScript Data structure and algorithm - The third edition 》- The first 2 Chapter ECMAScript And TypeScript summary

    Set poem August mid autumn White Dew , The passers-by is desolate : Sweet scented osmanthus in the water of Xiaoqiao , Think about it day and night . There is no peace in my heart , Read the article in the morning , Ten years in the study , Fang Xiancai, Gao Zhiguang . Preface Loianne · By Ms. Groner < Study JavaScript data ...

  8. reread 《 Study JavaScript Data structure and algorithm - The third edition 》- The first 4 Chapter Stack

    Set poem Jinshan bamboo shadow for thousands of autumn , The clouds fly high and the water flows by itself : The jade belt floats along the Yangtze River , A silver moon rolling golden ball . It's three thousand miles from Hubei , Nearly Sixteen States in the south of the Yangtze River : I can't see the beauty for a moment , Fate has its share in painting . Preface This chapter is a rereading of < Study JavaScript data structure ...

  9. Hadoop: The Definitive Guide (3rd Edition)

    chapter 1 Solve the problem of insufficient computing power , Not to make bigger computers , It's about using more computers to solve problems . We live in an age of data .“ big data ” It's not just scientific research and financial institutions that will be affected by the arrival of the Internet , For small businesses and for us personally ...

Random recommendation

  1. [ turn ]keil The use of,

    Section 1 System Overview Keil C51 Is the U.S. Keil Software Produced by the company 51 Series compatible with single chip microcomputer C Language software development system , Compared to assembly ,C Language is functional . Structural . Readability . There are obvious advantages in maintainability , So it's easy to learn and use . Used to ...

  2. linux security check

    1 ssh back door Procuratorial sentence : grep -E "user,pas|user:pas" /usr/bin/* /usr/local/sbin/* /usr/local/bin/* /b ...

  3. MySQL Select database use And mysql_select_db The use of,

      stay mysql If we choose and switch the database in command mode, we can use it directly use that will do , stay php Select data to use in mysql_select_db that will do , Let me introduce it .     From the command prompt , choice MySQL database : this ...

  4. be based on XMPP Agreed aSmack Source code analysis

    We are studying how to realize Pushing During the function , Collected a lot about Pushing Information , One of them androidnp Open source projects use more people , But because there was no one to maintain it for a long time , hear bug There's a lot of chance , In order to stabilize our products in the future ...

  5. Oracle EBS-SQL (AR-1): Check the amount of receivables

    SELECT SUM(nvl(dist.amount_dr,0))-SUM(nvl(dist.amount_cr,0)) FROM ar_cash_receipt_history_all hist,  ...

  6. Oracle System table practical operation notes

    1. Gets all the table names of the specified user : SQL1: SELECT OWNER AS " Object owner ", OBJECT_NAME AS " Table name ", OBJECT_ID AS ...

  7. python+appium+unittest Automatic test framework environment construction

    One . Basic software preparation 1.python The latest version ,python Of IDE Use pycharm. Specific Download Links : python https://www.python.org/ pycharm:https:/ ...

  8. Micropython TPYBoard ADC How to use

    Basic usage import pybadc = pyb.ADC(Pin('Y11')) # create an analog object from a pinadc = pyb.ADC(pyb.Pin.b ...

  9. C++ Variable / Function naming conventions

    ## reference Google C++ Variable naming in programming specifications 1. Variable All variable names are lowercase , Words are connected by underscores . The member variable of the class ends with an underscore . Naming common variables give an example : string window_name; // OK ...

  10. take python The script is converted to exe file --pyinstaller

    The big hole I met : Direct operation python Document effect :          perform  pyinstaller  -F -w  -p  -i ./123.ico  ./main.py    stay dict Generate under folder exe writing ...