Hive

Preface

Hive Is based on Hadoop A data warehouse tool , A structured data file can be mapped to a database table , And provide complete SQL Query function , Will class SQL The statement is converted to MapReduce Task execution .

Data organization format

Here's a list of files stored directly in HDFS Data organization on the Internet

  • Table: Each table is stored in HDFS Under a directory on the
  • Partition( Optional ): Every Partition Storage and then Table Subdirectory of
  • Bucket( Optional ): Some Partition According to the hash Hash values to different Bucket in , Every Bucket It's a document

Users can specify Partition The way and Bucket The way , So that in the process of execution can not scan some partitions . it seems Hive It's to specify first Partition The way , Again in the same Partition Internal calls hash function ;GreenPlum It's to specify first Hash The way , stay Hash Inside the slice , Specify different partition methods .

Hive framework

It can be seen from the above figure ,hdfs and mapreduce yes hive Foundation of architecture .

  • MetaStore: Storage and management Hive Metadata , Use a relational database to store metadata information .
  • Interpreters and compilers : take SQL Statements generate syntax trees , And then generate DAG, Become a logical plan
  • Optimizer : It only provides rule-based optimization
    • Column filtering : Only query projection Columns
    • Line filter : Subquery where Statement contains partition
    • Predicate push-down : Reduce the amount of data that follows
    • Join The way
      • Map join: A big watch and a small watch , Broadcast the watch ( After specified, count before execution , No data histogram )
      • shuffle join: according to hash function , Send data from two tables to join
      • sort merge join: Sort , Cut the data in order , The same range is sent to the same node ( Create two sorting tables in the background before running , Or specify when creating a table )
  • actuator : The actuator will DAG Convert to MR Mission

Hive characteristic

  • Hive The biggest characteristic is Hive By class SQL To analyze big data , Instead of writing MapReduce Program to analyze data , This makes it easier to analyze the data
  • Hive It's mapping data into databases and tables , The metadata information of database and table generally exists in relational database ( such as MySQL)
  • Hive It does not provide data storage function , Data is usually stored in HDFS Upper ( For data integrity 、 The format is not strict )
  • Hive It's easy to expand your storage and computing capabilities , This is inherited from hadoop Of ( It is suitable for large-scale parallel computing )
  • Hive It's for OLAP Design , Unsupported transaction

Hive technological process

Detailed analysis of the implementation process

Step 1:UI(user interface) call executeQuery Interface , send out HQL The query statement gives Driver

Step 2:Driver Create a session handle for the query statement , And send the query statement to Compiler, Wait for it to parse the statement and generate the execution plan

Step 3 and 4:Compiler from metastore Get relevant metadata

Step 5: Metadata is used for type checking of expressions in the query tree , And adjusting partitions based on query predicates , Generate plan

Step 6 (6.1,6.2,6.3): from Compiler The generated execution plan is phased DAG, Every stage may involve Map/Reduce job、 Metadata operations 、HDFS Operation of file ,Execution Engine It's going to be all stages of DAG Submit to the corresponding component for execution .

Step 7, 8 and 9: In every mission (mapper / reducer) in , The query results will be stored in a temporary file HDFS in . The temporary file to save the query results is created by Execution Engine Directly from HDFS Read , As from Driver Fetch API The return content of .

Fault tolerance ( Depend on Hadoop The ability of fault tolerance )

  1. Hive The implementation plan of MapReduce The framework is executed as a job , The intermediate result file of each job is written to the local disk , So as to achieve the fault tolerance of the operation .
  2. The final output file is written to HDFS file system , utilize HDFS To ensure the fault tolerance of data .

Hive defects

  1. MapReduce:
    1. Map After the mission , To write to disk
    2. One MapReduce After the mission , Intermediate results need to be persisted to HDFS
    3. DAG Generate MapReduce When the task , There will be meaningless Map Mission
    4. Hadoop Start up MapReduce The task will consume 5-10 second , It needs to be started many times MapReduce Mission

SparkSQL

SparkSQL Architecturally and Hive similar , It's just the bottom layer MapReduce Replace with Spark

In addition to replacing the underlying execution engine ,SparkSQL And did 3 Two aspects of optimization

  1. Memory based column cluster storage scheme is available
  2. Yes SQL Statement provides cost based optimization
  3. The data is sliced together

Cost based optimization

SparkSQL According to the distribution of the data , Count the slice size , Data histogram of hot spot data and so on . According to the histogram of the data, we can complete :

  1. Dynamically change the operator type according to the size of the table (join type ,aggeragate type )
  2. Determine the number of concurrency based on the size of the table (DAG The number of nodes split )

The data is sliced together

By creating tables , Specify how the data is distributed , Be similar to GreenPlum Appoint distribute. such join You don't have to switch on the Internet when you're on the road .

Hive and SparkSQL: be based on Hadoop More articles on data warehouse tools for

  1. Hive and SparkSQL: be based on Hadoop Data warehouse tools for

    Hive: be based on Hadoop Data warehouse tools for Preface Hive Is based on Hadoop A data warehouse tool , A structured data file can be mapped to a database table , And provide complete SQL Query function , Will class SQL Sentence conversion ...

  2. be based on hadoop Data warehouse tools for :Hive summary

    Hive Is based on Hadoop A data warehouse tool , A structured data file can be mapped to a database table , And provide complete sql Query function , Can be sql The statement is converted to MapReduce Task to run . Its advantage is low learning cost , You can use the ...

  3. be based on Hadoop Data warehouse Hive

    Hive Is based on Hadoop Data warehouse tools for , Can be stored in HDFS The data set in the file on the . Special query and analysis processing , Similar to SQL Language query language –HiveQL, It can be done by HQL Statement to achieve simple MR Statistics ,Hi ...

  4. Hive -- be based on Hadoop Data warehouse analysis tools for

    Hive It's based on Hadoop A data warehouse tool , A structured data file can be mapped to a database table , By class SQL Statement quick to implement simple MapReduce Statistics , You don't have to develop anything special MapReduce application , Perfect for data warehouse ...

  5. The road to big data week07--day05 ( One is based on Hadoop One of the data warehouse modeling tools of HIve)

    What is? Hive? Let me make a short and concise summary ( Interview frequently asked ) 1:hive Is based on hadoop One of the data warehouse modeling tools of ( There is something TEZ,Spark). 2:hive You can use classes sql dialect , The stored in the hdfs The data on the ...

  6. HIVE--- be based on Hadoop Data warehouse tools to explain

    Hadoop: Hadoop It's a by Apache Distributed system infrastructure developed by the foundation . To develop distributed programs . Make full use of the power of cluster for high-speed operation and storage .Hadoop Implemented a distributed file system (Hadoop Dist ...

  7. Hadoop Sort out five ( be based on Hadoop Data warehouse Hive)

    Data warehouse , It's a decision-making process for all levels of the enterprise , A strategic set that provides support for all types of data . It's a single data store , Created for analytical reporting and decision support purposes . For businesses that need business intelligence , Provide guidance for business process improvement . Monitoring time . cost . Quality and control ...

  8. [ turn ] X-RIME: be based on Hadoop Open source large-scale social network analysis tools

    from http://www.dataguru.cn/forum.php?mod=viewthread&tid=286174 With the rapid development of Internet , A large number of people have sprung up to Facebook,Twitter ...

  9. Hive-- Executable SQL Of Hadoop Data warehouse management tools

    Hive It's based on HDFS Data warehouse software for , It can be understood as a database management tool :Hive The main functions are : 1.  Support use SQL Read large datasets of distributed storage . Write . management , take SQL Turn it into MapReduce Task execution : 2. ...

Random recommendation

  1. Pretend to be a noun -ABA CAS SpinLock

    See today wiki, See what a person mentions and fall into race condition & ABA problem. I've never heard of it ABA ah , So I went to search , as follows : http://www.bubuko.com ...

  2. How to build NTP service

    lately , In the building Oracle RAC In the process , Need to use DNS and NTP, among ,DNS For domain names .IP management ,NTP For time synchronization . Actually , These two services were built a long time ago , But technology , Essentially , accord with “ Use in, waste out ” The objective law of the law . The more you use it ...

  3. CF 504E Misha and LCP on Tree( The tree chain splits + The suffix array )

    Topic link :http://codeforces.com/problemset/problem/504/E The question : Give me a tree , Each node has a letter . Each query gives two paths , Ask the longest common prefix of the string of these two paths . ...

  4. clearfix Clear floating problems

    Today I read a blog , I found that there are many ways to clear floating , Each have advantages and disadvantages Using pseudo classes :after The pseudo class layer of the height zero of the subsequent null system is cleared use CSS overflow:auto The way to hold up use CSS overflow:hid ...

  5. Application of database —— Read directly from memory osg node ( turn )

    Application of database —— Read directly from memory osg node Purpose : To read node data from the database to osg. The way to start is like this , Whenever I want to add node data in a database , First read it into memory , Then write a file , Finally, from the file again ...

  6. Aizu 2304 Reverse Roads Cost stream

    Reverse Roads Time Limit: 1 Sec Memory Limit: 256 MB Topic linking http://acm.hust.edu.cn/vjudge/contest/view ...

  7. 2016/9/7 jdbc.properties Configuration database related

    ##MySQL#jdbc.driver=com.mysql.jdbc.Driver#jdbc.url=jdbc:mysql://localhost:3306/test#jdbc.username=ro ...

  8. Architecture design of large web sites —- The system of large high concurrent and high load website

    Reprint :http://www.cnblogs.com/cxd4321/archive/2010/11/24/1886301.html With China's large-scale IT The acceleration of enterprise informatization , The amount of data and access of most applications is urgent ...

  9. WPF 3D: Use... In transformation TranslateTransform3D

    original text :WPF 3D: Use... In transformation TranslateTransform3D Program effect : WPF 3D Medium TranslateTransform3D It should be all 3D The simplest transformation of transformations , It's very simple to use , First define ...

  10. Once a week .NET Cutting edge technology article abstract (2017-06-21)

    Summary of foreign .NET Community related articles , Cover .NET ,ASP.NET The content such as : .NET .NET Core Magic: Develop on one OS, run on another link :https: ...