from http://www.alidata.org/archives/581

Hive The query language is described in detail in the official document of , Please refer to :http://wiki.apache.org/hadoop/Hive/LanguageManual , Most of the content of this article is translated from this page , During the period, some matters needing attention in the process of use were added .

Create Table

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path] 

CREATE TABLE Create a table with the specified name . If a table with the same name already exists , Throw an exception ; Users can use IF NOT EXIST Option to ignore this exception .

EXTERNAL Keywords allow users to create an external table , Specify a path to the actual data while building the table (LOCATION),Hive When creating an internal table , Will move the data to the path the data warehouse points to ; To create an external table , Only the path where the data is recorded , Don't make any changes to the location of the data . When deleting a table , The metadata and data of the internal table will be deleted together , External tables only delete metadata , Do not delete data .

LIKE Allow users to copy existing table structures , But do not copy data .

Users can customize when creating tables SerDe Or use your own SerDe. If not specified ROW FORMAT perhaps ROW FORMAT DELIMITED, Will use the SerDe. While building the watch , You also need to specify columns for the table , The user will also specify a custom... While specifying the columns of the table SerDe,Hive adopt SerDe Data that determines the specific columns of a table .

If the file data is plain text , have access to STORED AS TEXTFILE. If data needs to be compressed , Use STORED AS SEQUENCE .

Tables with partitions can be used when creating PARTITIONED BY sentence . A table can have one or more partitions , Each partition exists in a directory . and , Both tables and partitions can be used to CLUSTERED BY operation , Put several columns in a bucket (bucket) in . You can also use SORT BY Sort the data . This can improve performance for specific applications .

Table and column names are not case sensitive ,SerDe And property names are case sensitive . Comments for tables and columns are strings .

Drop Table

Deleting an internal table will delete both metadata and data of the table . Delete an external table , Just delete the metadata and keep the data .

Alter Table

Alter table Statement allows the user to change the structure of an existing table . Users can add columns / Partition , change serde, Add tables and serde be familiar with , The table itself is renamed .

Add Partitions

ALTER TABLE table_name ADD partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ... partition_spec: : PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...) 

Users can use ALTER TABLE ADD PARTITION To add partitions to a table . Use quotation marks when the partition name is a string .

ALTER TABLE page_view ADD PARTITION (dt='2008-08-08', country='us') location '/path/to/us/part080808' PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809'; 

DROP PARTITION

ALTER TABLE table_name DROP partition_spec, partition_spec,... 

Users can use ALTER TABLE DROP PARTITION To delete the partition . The metadata and data of the partition will be deleted together .

ALTER TABLE page_view DROP PARTITION (dt='2008-08-08', country='us'); 

RENAME TABLE

ALTER TABLE table_name RENAME TO new_table_name 

This command allows the user to rename the table . The location of the data and the partition name do not change . In other words , The old watch name doesn't “ Release ”, Changes to the old table will change the data of the new table .

Change Column Name/Type/Position/Comment

ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name] 

This command allows the user to modify the name of a column 、 data type 、 Notes or positions .

such as :

CREATE TABLE test_change (a int, b int, c int);

ALTER TABLE test_change CHANGE a a1 INT;  take a Change the name of the column to a1.

ALTER TABLE test_change CHANGE a a1 STRING AFTER b;  take a Change the name of the column to a1,a Change the data type of the column to string, And put it in the column b after . The new table structure is : b int, a1 string, c int.

ALTER TABLE test_change CHANGE b b1 INT FIRST;  Will b Change the name of the column to b1, And put it in the first column . The structure of the new table is : b1 int, a string, c int.

Be careful : Changes to columns only modify Hive Metadata , It doesn't change the actual data . Users should make sure that the metadata definition is consistent with the actual data structure .

Add/Replace Columns

ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...) 

ADD COLUMNS Allow users to add new columns at the end of the current column , But before partitioning Columns .

REPLACE COLUMNS Delete later Columns , Add a new column . Only in use native Of SerDE(DynamicSerDe or MetadataTypeColumnsetSerDe) You can only do this when you are in a hurry .

Alter Table Properties

ALTER TABLE table_name SET TBLPROPERTIES table_properties table_properties: : (property_name = property_value, property_name = property_value, ... ) 

The user can add to the table with this command metadata, at present last_modified_user,last_modified_time Attributes are all created by Hive Automatically managed . Users can add their own attributes to the list . have access to DESCRIBE EXTENDED TABLE To get this information .

Add Serde Properties

ALTER TABLE table_name SET SERDE serde_class_name [WITH SERDEPROPERTIES serde_properties] ALTER TABLE table_name SET SERDEPROPERTIES serde_properties serde_properties: : (property_name = property_value, property_name = property_value, ... ) 

This command allows the user to send SerDe Object to add user-defined metadata .Hive To serialize and deserialize data , Will initialize SerDe attribute , And pass the properties to the SerDe. such , Users can customize SerDe Storage attribute .

Alter Table File Format and Organization

ALTER TABLE table_name SET FILEFORMAT file_format ALTER TABLE table_name CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)] INTO num_buckets BUCKETS 

This command modifies the physical storage properties of the table .

Loading files into table

When data is loaded into a table , There is no conversion of the data .Load The operation just copies the data / Move to Hive Table location .

Syntax:

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] 

Synopsis:

Load Operation is just a simple copy / Mobile operating , Move data files to Hive Table location .

  • filepath It can be :
    • Relative paths , for example :project/data1
    • Absolute path , for example : /user/hive/project/data1
    • Include the integrity of the pattern URI, for example :hdfs://namenode:9000/user/hive/project/data1
  • The target of loading can be a table or partition . If the table contains partitions , You must specify the partition name for each partition .
  • filepath You can refer to a file ( In this case ,Hive Will move the file to the directory corresponding to the table ) Or a directory ( under these circumstances ,Hive All the files in the directory will be moved to the directory corresponding to the table ).
  • If you specify LOCAL, that :
    • load The command will look for... In the local file system filepath. If the discovery is a relative path , Then the path will be interpreted as the current path relative to the current user . Users can also specify a complete URI, such as :file:///user/hive/project/data1.
    • load Command will filepath The files in are copied to the target file system . The target file system is determined by the table's location attribute . The copied data file is moved to the corresponding position of the table data .
  • If not specified LOCAL keyword , If filepath Point to a complete URI,hive Will use this directly URI. otherwise :
    • If not specified schema perhaps authority,Hive Will be used in hadoop As defined in the configuration file schema and authority,fs.default.name It specifies Namenode Of URI.
    • If the path is not absolute ,Hive be relative to /user/ Explain .
    • Hive Will filepath Move the contents of the file specified in to table ( perhaps partition) In the specified path .
  • If used OVERWRITE keyword , Target table ( Or zoning ) The content in ( If there is ) Will be deleted , And then filepath Files pointed to / Content from catalog added to table / partition .
  • If the target table ( Partition ) There is already a file , And the filename and filepath File name conflict in , Then the existing file will be replaced by the new one .

SELECT

Syntax

SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [ CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] ] [LIMIT number] 
  • One SELECT A statement can be a union A query or part of a subquery .
  • table_reference Is the input to the query , It could be a normal watch 、 A view 、 One join Or a subquery
  • Simple query . for example , The following sentence starts from t1 Query the information of all columns in the table .
SELECT * FROM t1 

WHERE Clause

where condition It's a Boolean expression . for example , The following query only returns sales records greater than 10, And it belongs to a sales representative in the United States .Hive Does not support the WHERE In Clause IN,EXIST Or subquery .

SELECT * FROM sales WHERE amount > 10 AND region = "US" 

ALL and DISTINCT Clauses

Use ALL and DISTINCT Options distinguish between the handling of duplicate records . The default is ALL, Means to query all records .DISTINCT Remove duplicate records .

hive> SELECT col1, col2 FROM t1 1 3 1 3 1 4 2 5 hive> SELECT DISTINCT col1, col2 FROM t1 1 3 1 4 2 5 hive> SELECT DISTINCT col1 FROM t1 1 2 

be based on Partition Query for

commonly SELECT The query scans the entire table ( Except for sampling queries ). But if a table uses PARTITIONED BY Clause to create a table , Query can use partition pruning (input pruning) Characteristics of , Scan only the part of a table that it cares about .Hive The current implementation is , Only partition assertions are present in FROM The nearest one WHERE clause , To enable partition pruning . for example , If page_views Table use date Column partition , The following statement only reads the partition as ‘2008-03-01’ The data of .

SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'; 

HAVING Clause

Hive Not now HAVING Clause . Can be HAVING Clause into a word query , for example :

SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10 

You can use the following query to express :

SELECT col1 FROM (SELECT col1, SUM(col2) AS col2sum FROM t1 GROUP BY col1) t2 WHERE t2.col2sum > 10 

LIMIT Clause

Limit You can limit the number of records to be queried . The results of the query are randomly selected . The following query is from t1 Random queries in the table 5 Bar record :

SELECT * FROM t1 LIMIT 5 

Top k Inquire about . The following query statement queries the largest sales record 5 Sales reps .

SET mapred.reduce.tasks = 1 SELECT * FROM sales SORT BY amount DESC LIMIT 5 

REGEX Column Specification

SELECT Statement can use regular expressions for column selection , The following statement queries in addition to ds and hr All columns except :

SELECT `(ds|hr)?+.+` FROM sales 

Join

Syntax

join_table: table_reference JOIN table_factor [join_condition] | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition | table_reference LEFT SEMI JOIN table_reference join_condition table_reference: table_factor | join_table table_factor: tbl_name [alias] | table_subquery alias | ( table_references ) join_condition: ON equality_expression ( AND equality_expression )* equality_expression: expression = expression 

Hive Only equivalent connection is supported (equality joins)、 External connection (outer joins) and (left semi joins???).Hive All non equivalent connections are not supported , Because it's very difficult to convert non equivalent connections to map/reduce Mission . in addition ,Hive Support more than 2 Connections of tables .

Write join When inquiring , Several key points need to be noted :
1. Only equivalence is supported join, for example :

SELECT a.* FROM a JOIN b ON (a.id = b.id) SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department) 

That's right. , However :

SELECT a.* FROM a JOIN b ON (a.id b.id) 

It's wrong. .

2. Sure join More than 2 Tables , for example

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2) 

If join Of multiple tables in join key Is the same , be join Will be converted to a single map/reduce Mission , for example :

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1) 

Is converted to a single map/reduce Mission , because join Only used in b.key1 As join key.

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2) 

And this one join Converted to 2 individual map/reduce Mission . because b.key1 For the first time join Conditions , and b.key2 For the second time join.

join when , Every time map/reduce The logic of the task is this :reducer Will cache join Records of all tables in the sequence except the last table , Then serialize the results to the file system through the last table . This implementation helps reduce Reduce memory usage . In practice , The biggest watch should be written at the end ( Otherwise, it will waste a lot of memory due to cache ). for example :

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1) 

All tables use the same join key( Use 1 Time map/reduce Task calculation ).Reduce End cache a Table and b The record of the table , And get one at a time c The record of the table is calculated once join result , Similar to that :

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2) 

This is used here. 2 Time map/reduce Mission . First cache a surface , use b Table serialization ; Second cache first cache map/reduce Results of the task , And then use c Table serialization .

LEFT,RIGHT and FULL OUTER Keywords for processing join Hollow records , for example :

SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key) 

Corresponding to all a All records in the table have one record output . The output should be a.val, b.val, When a.key=b.key when , And when b.key No equivalent found in a.key Output when recording a.val, NULL.“FROM a LEFT OUTER JOIN b” This sentence must be written on the same line —— intend a Table in b Tabular On the left , therefore a All the records in the table have been kept ;“a RIGHT OUTER JOIN b” Will keep all b The record of the table .OUTER JOIN Semantics should follow the standard SQL spec Of .

Join It happened in WHERE Clause Before . If you want to limit join Output , belong WHERE Clause write filter —— Or in join Clause . One of the confusing problems is table partitioning :

SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key) WHERE a.ds='2009-07-07' AND b.ds='2009-07-07' 

Meeting join a Table to b surface (OUTER JOIN), List a.val and b.val The record of .WHERE Other columns can be used as filter conditions in clauses . however , As mentioned earlier , If b No corresponding found in table a The record of the table ,b All columns of the table are listed NULL, Include ds Column . in other words ,join It filters b No match found in table a surface join key All records . In this case ,LEFT OUTER So that the query results WHERE Clause is irrelevant . The solution is OUTER JOIN Use the following syntax when :

SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key AND b.ds='2009-07-07' AND a.ds='2009-07-07') 

The result of this query is in advance join Stage filtered , So there is no such problem . This logic can also be applied to RIGHT and FULL Type of join in .

Join You can't swap places . Whether it's LEFT still RIGHT join, It's all left connected .

SELECT a.val1, a.val2, b.val, c.val FROM a JOIN b ON (a.key = b.key) LEFT OUTER JOIN c ON (a.key = c.key) 

First join a Table to b surface , Discard all join key Unmatched records in , Then use this intermediate result and c Watch making join. There is a less obvious problem with this statement , To be one key stay a Table and c All tables exist , however b When the table does not exist : For the first time join, namely a JOIN b It's all lost ( Include a.val1,a.val2 and a.key), And then we'll talk to each other c surface join When , If c.key And a.key or b.key equal , That's what we're going to get :NULL, NULL, NULL, c.val.

LEFT SEMI JOIN yes IN/EXISTS A more efficient implementation of subqueries .Hive There is currently no implementation IN/EXISTS Subquery , So you can use LEFT SEMI JOIN Rewrite your subquery statement .LEFT SEMI JOIN The restrictions are , JOIN The table on the right side of the clause can only be in ON Set filter condition in Clause , stay WHERE Clause 、SELECT Clause or other filtering doesn't work .

SELECT a.key, a.value FROM a WHERE a.key in (SELECT b.key FROM B); 

Can be rewritten as :

SELECT a.key, a.val FROM a LEFT SEMI JOIN b on (a.key = b.key)

Hive QL More articles about

  1. Hive Create a form report 【Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException】 The murder that started

    After successfully starting Hive After that, I lamented that it didn't appear this time Bug 了 , With confidence, I gave a long command to create a table , As a result, reality once again gave me a stick , Reported the following error Error, return code 1 from org.apache ...

  2. hive Use where Condition error java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.ppd.ExprWalkerInfo.getConvertedNode

    hadoop edition 2.6.0 hive edition 1.1.1 error : java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.ppd.ExprWalk ...

  3. Hive QL Introduce

    Summary This course has learned Hive QL Basic grammar and operation . One . Description of experimental environment 1. Environment login Automatic login without password , System user name shiyanlou, password shiyanlou 2. Introduction to the environment In this experiment, the experimental environment is the one with desktop ...

  4. FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:javax.jdo.JDODataStoreException: An exception was thrown while adding/validating class(es) :

    stay hive An error occurred when the command line created the table : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. ...

  5. 【 original 】 Uncle problem positioning sharing (16)spark Write data to the hive External report error ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat

    spark 2.1.1 spark Writing data to hive External table ( The underlying data is hbase in ) When an error Caused by: java.lang.ClassCastException: org.apache.h ...

  6. hive When importing data from local 「Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask」 error

    The phenomenon adopt load data local Importing local files is an error that cannot be imported hive> load data local inpath '/home/hadoop/out/mid_test.txt' ...

  7. Hive And FAILED: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient abnormal

    One . scene Hive Start without error , When using show functions; or create table... times :FAILED: SemanticException org.apache.hadoop.hive ...

  8. hive Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

    Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata. ...

  9. FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(me

    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(me ...

  10. hive encounter FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask error

    hive encounter FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask error cause ...

Random recommendation

  1. Commonly used js,css Unified file loading method , And call the callback function after loading

    Original content , Reprint please indicate the source ! In order to facilitate resource management and improve work efficiency , frequently-used js and css The loading of files should be completed in a unified file , It is also convenient for subsequent resource maintenance . So I use it js I wrote the following method , Store in “sourceControl ...

  2. modify eclipse in web Project server Deployment path

    and MyEclipse Dissimilarity , stay Eclipse In the do Web By default, the project does not support publishing the project to Web On the server , Will be published to a directory in the workspace , So it can't be started externally Tomcat To run the Web project , Only open Eclipse ...

  3. Python Design patterns —— Decoration mode (Decorator)

    If we need to develop a program to show a person how to dress . #encoding=utf-8 __author__ = 'kevinlu1010@qq.com' class Person(): def __in ...

  4. js Perfect inheritance code example

    <!doctype html> <html lang="en"> <head> <meta charset="UTF-8&quo ...

  5. stay Delphi The specified application is invoked in the developed service

    Original works , Allowed to reprint , When reprinting, please make sure to indicate the article in the form of hyperlink   Original source  . Author information and this statement . Otherwise, the legal liability will be investigated .http://fxh7622.blog.51cto.com/63841/529033 In many cases ...

  6. Cocos2d-x 3.0final Terminator series tutorial 14-L new abel-Cocos2d-x Document

    Objective record New text tag class Label Other text tags Introduction to the use of font making tools Summary https://github.com/chukong/cocos-docs/blob/master/manual/framew ...

  7. Building blocks of machine learning 3 Types of Learning

    Building blocks of machine learning 3 Types of Learning Learning with Different Output Space Learning with Different Data Label L ...

  8. poj-1005-l tanink i need a houseboat

    Description Fred Mapper is considering purchasing some land in Louisiana to build his house on. In t ...

  9. Progressive Scramble【 simulation 】

    problem J: Progressive Scramble The time limit : 1 Sec   Memory limit : 128 MB Submit : 108   solve : 45 [ Submit ] [ state ] [ Propositional person :admin] Title Description You ...

  10. 《 Consider the past you shall know the future 》JAVA Foundation one

    One : Basic environment species javaSE( Core grammar )+javaME( Mobile devices )+Java EE(JSP/ Service etc. ) JVM(java virtual machine ) abc.java( Source file )-- compiler --abc.class Bytecode text ...