Hive Is based on Hadoop A data warehouse tool , A structured data file can be mapped to a database table , And provide complete SQL Query function , Will class SQL The statement is converted to MapReduce Task execution .
Data organization format
Here's a list of files stored directly in HDFS Data organization on the Internet
- Table： Each table is stored in HDFS Under a directory on the
- Partition( Optional )： Every Partition Storage and then Table Subdirectory of
- Bucket( Optional )： Some Partition According to the hash Hash values to different Bucket in , Every Bucket It's a document
Users can specify Partition The way and Bucket The way , So that in the process of execution can not scan some partitions . it seems Hive It's to specify first Partition The way , Again in the same Partition Internal calls hash function ;GreenPlum It's to specify first Hash The way , stay Hash Inside the slice , Specify different partition methods .
It can be seen from the above figure ,hdfs and mapreduce yes hive Foundation of architecture .
- MetaStore： Storage and management Hive Metadata , Use a relational database to store metadata information .
- Interpreters and compilers ： take SQL Statements generate syntax trees , And then generate DAG, Become a logical plan
- Optimizer ： It only provides rule-based optimization
- Column filtering ： Only query projection Columns
- Line filter ： Subquery where Statement contains partition
- Predicate push-down ： Reduce the amount of data that follows
- Join The way
- Map join： A big watch and a small watch , Broadcast the watch ( After specified, count before execution , No data histogram )
- shuffle join： according to hash function , Send data from two tables to join
- sort merge join： Sort , Cut the data in order , The same range is sent to the same node ( Create two sorting tables in the background before running , Or specify when creating a table )
- actuator ： The actuator will DAG Convert to MR Mission
- Hive The biggest characteristic is Hive By class SQL To analyze big data , Instead of writing MapReduce Program to analyze data , This makes it easier to analyze the data
- Hive It's mapping data into databases and tables , The metadata information of database and table generally exists in relational database （ such as MySQL）
- Hive It does not provide data storage function , Data is usually stored in HDFS Upper （ For data integrity 、 The format is not strict ）
- Hive It's easy to expand your storage and computing capabilities , This is inherited from hadoop Of （ It is suitable for large-scale parallel computing ）
- Hive It's for OLAP Design , Unsupported transaction
Hive technological process
Detailed analysis of the implementation process
Step 1：UI(user interface) call executeQuery Interface , send out HQL The query statement gives Driver
Step 2：Driver Create a session handle for the query statement , And send the query statement to Compiler, Wait for it to parse the statement and generate the execution plan
Step 3 and 4：Compiler from metastore Get relevant metadata
Step 5： Metadata is used for type checking of expressions in the query tree , And adjusting partitions based on query predicates , Generate plan
Step 6 (6.1,6.2,6.3)： from Compiler The generated execution plan is phased DAG, Every stage may involve Map/Reduce job、 Metadata operations 、HDFS Operation of file ,Execution Engine It's going to be all stages of DAG Submit to the corresponding component for execution .
Step 7, 8 and 9： In every mission （mapper / reducer） in , The query results will be stored in a temporary file HDFS in . The temporary file to save the query results is created by Execution Engine Directly from HDFS Read , As from Driver Fetch API The return content of .
Fault tolerance （ Depend on Hadoop The ability of fault tolerance ）
- Hive The implementation plan of MapReduce The framework is executed as a job , The intermediate result file of each job is written to the local disk , So as to achieve the fault tolerance of the operation .
- The final output file is written to HDFS file system , utilize HDFS To ensure the fault tolerance of data .
- Map After the mission , To write to disk
- One MapReduce After the mission , Intermediate results need to be persisted to HDFS
- DAG Generate MapReduce When the task , There will be meaningless Map Mission
- Hadoop Start up MapReduce The task will consume 5-10 second , It needs to be started many times MapReduce Mission
SparkSQL Architecturally and Hive similar , It's just the bottom layer MapReduce Replace with Spark
In addition to replacing the underlying execution engine ,SparkSQL And did 3 Two aspects of optimization
- Memory based column cluster storage scheme is available
- Yes SQL Statement provides cost based optimization
- The data is sliced together
Cost based optimization
SparkSQL According to the distribution of the data , Count the slice size , Data histogram of hot spot data and so on . According to the histogram of the data, we can complete ：
- Dynamically change the operator type according to the size of the table (join type ,aggeragate type )
- Determine the number of concurrency based on the size of the table (DAG The number of nodes split )
The data is sliced together
By creating tables , Specify how the data is distributed , Be similar to GreenPlum Appoint distribute. such join You don't have to switch on the Internet when you're on the road .
- Hive and SparkSQL： be based on Hadoop Data warehouse tools for
Hive: be based on Hadoop Data warehouse tools for Preface Hive Is based on Hadoop A data warehouse tool , A structured data file can be mapped to a database table , And provide complete SQL Query function , Will class SQL Sentence conversion ...
- be based on hadoop Data warehouse tools for ：Hive summary
Hive Is based on Hadoop A data warehouse tool , A structured data file can be mapped to a database table , And provide complete sql Query function , Can be sql The statement is converted to MapReduce Task to run . Its advantage is low learning cost , You can use the ...
- be based on Hadoop Data warehouse Hive
Hive Is based on Hadoop Data warehouse tools for , Can be stored in HDFS The data set in the file on the . Special query and analysis processing , Similar to SQL Language query language –HiveQL, It can be done by HQL Statement to achieve simple MR Statistics ,Hi ...
- Hive -- be based on Hadoop Data warehouse analysis tools for
Hive It's based on Hadoop A data warehouse tool , A structured data file can be mapped to a database table , By class SQL Statement quick to implement simple MapReduce Statistics , You don't have to develop anything special MapReduce application , Perfect for data warehouse ...
- The road to big data week07--day05 （ One is based on Hadoop One of the data warehouse modeling tools of HIve）
What is? Hive? Let me make a short and concise summary ( Interview frequently asked ) 1:hive Is based on hadoop One of the data warehouse modeling tools of ( There is something TEZ,Spark). 2:hive You can use classes sql dialect , The stored in the hdfs The data on the ...
- HIVE--- be based on Hadoop Data warehouse tools to explain
Hadoop: Hadoop It's a by Apache Distributed system infrastructure developed by the foundation . To develop distributed programs . Make full use of the power of cluster for high-speed operation and storage .Hadoop Implemented a distributed file system (Hadoop Dist ...
- Hadoop Sort out five （ be based on Hadoop Data warehouse Hive）
Data warehouse , It's a decision-making process for all levels of the enterprise , A strategic set that provides support for all types of data . It's a single data store , Created for analytical reporting and decision support purposes . For businesses that need business intelligence , Provide guidance for business process improvement . Monitoring time . cost . Quality and control ...
- [ turn ] X-RIME: be based on Hadoop Open source large-scale social network analysis tools
from http://www.dataguru.cn/forum.php?mod=viewthread&tid=286174 With the rapid development of Internet , A large number of people have sprung up to Facebook,Twitter ...
- Hive-- Executable SQL Of Hadoop Data warehouse management tools
Hive It's based on HDFS Data warehouse software for , It can be understood as a database management tool :Hive The main functions are : 1. Support use SQL Read large datasets of distributed storage . Write . management , take SQL Turn it into MapReduce Task execution : 2. ...
- Pretend to be a noun -ABA CAS SpinLock
See today wiki, See what a person mentions and fall into race condition & ABA problem. I've never heard of it ABA ah , So I went to search , as follows : http://www.bubuko.com ...
- How to build NTP service
lately , In the building Oracle RAC In the process , Need to use DNS and NTP, among ,DNS For domain names .IP management ,NTP For time synchronization . Actually , These two services were built a long time ago , But technology , Essentially , accord with “ Use in, waste out ” The objective law of the law . The more you use it ...
- CF 504E Misha and LCP on Tree（ The tree chain splits + The suffix array ）
Topic link :http://codeforces.com/problemset/problem/504/E The question : Give me a tree , Each node has a letter . Each query gives two paths , Ask the longest common prefix of the string of these two paths . ...
- clearfix Clear floating problems
Today I read a blog , I found that there are many ways to clear floating , Each have advantages and disadvantages Using pseudo classes :after The pseudo class layer of the height zero of the subsequent null system is cleared use CSS overflow:auto The way to hold up use CSS overflow:hid ...
- Application of database —— Read directly from memory osg node （ turn ）
Application of database —— Read directly from memory osg node Purpose : To read node data from the database to osg. The way to start is like this , Whenever I want to add node data in a database , First read it into memory , Then write a file , Finally, from the file again ...
- Aizu 2304 Reverse Roads Cost stream
Reverse Roads Time Limit: 1 Sec Memory Limit: 256 MB Topic linking http://acm.hust.edu.cn/vjudge/contest/view ...
- 2016/9/7 jdbc.properties Configuration database related
- Architecture design of large web sites —- The system of large high concurrent and high load website
Reprint :http://www.cnblogs.com/cxd4321/archive/2010/11/24/1886301.html With China's large-scale IT The acceleration of enterprise informatization , The amount of data and access of most applications is urgent ...
- WPF 3D： Use... In transformation TranslateTransform3D
original text :WPF 3D: Use... In transformation TranslateTransform3D Program effect : WPF 3D Medium TranslateTransform3D It should be all 3D The simplest transformation of transformations , It's very simple to use , First define ...
- Once a week .NET Cutting edge technology article abstract （2017-06-21）
Summary of foreign .NET Community related articles , Cover .NET ,ASP.NET The content such as : .NET .NET Core Magic: Develop on one OS, run on another link :https: ...