hive.optimize.cp=true: Column cut
hive.optimize.prunner: Cut in sections
hive.limit.optimize.enable=true: Optimize LIMIT n sentence
hive.limit.optimize.limit.file=10: Maximum number of files

1. Local mode ( Small tasks ):
The following conditions need to be met :
1.job The input data size of must be less than the parameter Default 128MB)
2.job Of map The number must be less than the parameter Default 4)
3.job Of reduce The number must be 0 perhaps 1
hive.mapred.local.mem: Local mode enabled JVM Memory size

2. Concurrent execution :
hive.exec.parallel=true , The default is false

3.Strict Mode:
hive.mapred.mode=true, Strict mode does not allow the following queries to be executed :
There is no partition specified on the partition table
No, limit The limit order by sentence
The cartesian product :JOIN Not at the time ON sentence

4. Dynamic partitioning :
hive.exec.dynamic.partition.mode=strict: In this mode, you must specify a static partition
hive.exec.max.dynamic.partitions.pernode=100: In every one of them mapper/reducer The maximum number of partitions a node is allowed to create
DATANODE:dfs.datanode.max.xceivers=8192: allow DATANODE How many files to open

5. Speculative execution :

6.Single MapReduce MultiGROUP BY
hive.multigroupby.singlemar=true: When more than one GROUP BY Statements have the same grouped Columns , It will be optimized to a MR Mission

7. hive.exec.rowoffset: Whether to provide virtual columns

8. grouping
Two aggregation functions cannot have different DISTINCT Column , The following expression is wrong :
INSERT OVERWRITE TABLE pv_gender_agg SELECT pv_users.gender,
count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip) FROM
pv_users GROUP BY pv_users.gender;
SELECT There can only be GROUP BY Column or aggregate function of .

9.; stay map Some aggregation operations will be done in , More efficient but requires more memory .
hive.groupby.mapaggr.checkinterval: stay Map The number of entries to be aggregated at the end

hive.groupby.skewindata=true: Load balancing with data skew , The selected item is set to true, The generated query plan will have two MRJob. first MRJob in ,
Map The output result set of will be randomly distributed to Reduce in , Every Reduce Do partial aggregation , And output the result , The result of this treatment is the same GroupBy Key
May be distributed to different Reduce in , So as to achieve the goal of load balancing ; the second MRJob And then according to the data result of preprocessing according to GroupBy Key Distributed to
Reduce in ( This process is guaranteed to be the same GroupBy Key It's distributed to the same Reduce in ), The final aggregation operation is completed .

11.Multi-Group-By Inserts:
FROM test
SELECT count(DISTINCT test.dqcode)
GROUP BY test.zipcode
SELECT count(DISTINCT test.dqcode)
GROUP BY test.sfcode;

12. Sort
hive.mapred.mode=strict You need to talk to limit Clause
hive.mapred.mode=nonstrict Use a single reduce To complete the order
SORT BY colName ASC/DESC : Every reduce Internal order
DISTRIBUTE BY( Use... In the case of subqueries ): Control where a particular line should go reducer, There is no guarantee that reduce The order of the data in the
CLUSTER BY : When SORT BY 、DISTRIBUTE BY When using the same column .

13. Merge small files
hive.merg.mapfiles=true: Merge map Output
hive.merge.mapredfiles=false: Merge reduce Output
hive.merge.size.per.task=256*1000*1000: The size of the merged file
hive.mergejob.maponly=true: If the support CombineHiveInputFormat Then, only Map The task of merge
hive.merge.smallfiles.avgsize=16000000: When the average size of the file is less than this value , Will start a MR Task execution merge. number
Reduce map number :
set mapred.max.split.size
set mapred.min.split.size
set mapred.min.split.size.per.node
set mapred.min.split.size.per.rack
increase map number :
When input The papers are very large , The logic of the task is complex ,map When execution is very slow , You can think about adding Map Count , To make each map The amount of data being processed is reduced , So as to improve the efficiency of task execution .
Suppose there is such a task :
select data_desc, count(1), count(distinct id),sum(case when …),sum(case when ...),sum(…) from a group by data_desc
If the table a Only one file , The size is 120M, But it contains tens of millions of records , If you use 1 individual map To complete this task , It must be time-consuming , In this case , We need to consider splitting this file into multiple pieces , So you can use multiple map The task is to complete .
set mapred.reduce.tasks=10;
create table a_1 as select * from a distribute by rand(123);
This will a The record of the table , Randomly distributed to include 10 File a_1 In the table , Reuse a_1 Instead of the above sql Medium a surface , Will use 10 individual map The task is to complete . Every map Task processing is greater than 12M( Millions of records ) The data of , Efficiency will definitely be much better .

reduce Number settings :
Parameters 1:hive.exec.reducers.bytes.per.reducer=1G: Every reduce The amount of data processed by the task
Parameters 2:hive.exec.reducers.max=999(0.95*TaskTracker Count ): Maximum for each task reduce number
reducer Count =min( Parameters 2, Total amount of input data / Parameters 1)
set mapred.reduce.tasks: The default for each task is reduce number . Typical for 0.99*reduce Number of slots ,hive Set it to -1, Automatically determine reduce number .

15. Use index :
hive.optimize.index.filter: Use index automatically
hive.optimize.index.groupby: Use aggregate index optimization GROUP BY operation

Hive More articles on Optimization

  1. Hive 12、Hive Optimize

    The main points of : To optimize the , hold hive sql treat as map reduce Program to read , There will be unexpected surprises . understand hadoop The core competencies of , yes hive The essence of optimization . Long term observation hadoop The process of processing data , There are several salient features : 1. ...

  2. hive Optimize it —— control hive In the task map Sum of numbers reduce Count

    One .    control hive In the task map Count : 1.    Usually , The assignment will pass input Creates one or more directories map Mission . The main determinants are : input The total number of files ,input File size , The text of cluster settings ...

  3. Hive Optimization case

    1.Hadoop The characteristics of computing framework The amount of data is not a problem , Data skewing is a problem . jobs The efficiency of the operation with more numbers is relatively low , For example, even if there are millions of watches , If multiple associations, multiple summaries , More than a dozen jobs, It takes a long time . as a result of map re ...

  4. Learn together Hive—— Summarize the common Hive Optimization techniques

    To sum up today, I am using Hive Some optimization techniques in the process , I hope to help you .Hive Optimization best reflects the technical ability of programmers , What interviewers like to ask most in an interview is Hive Optimization techniques . skill 1. control reducer Number The following is my ...

  5. Big data technology _08_Hive Study _04_ Compression and storage (Hive senior )+ Enterprise level tuning (Hive Optimize )

    The first 8 Chapter Compression and storage (Hive senior )8.1 Hadoop Source compilation support Snappy Compress 8.1.1 Resources to prepare 8.1.2 jar Package installation 8.1.3 Compile source code 8.2 Hadoop Compression configuration 8.2.1 MR Supported compression ...

  6. Big data development practice :Hive Optimize the actual battle 3- The big table join Big watch optimization

    5. The big table join Big watch optimization If Hive Optimize the actual battle 2 in mapjoin Small and medium-sized watch dim_seller It's very big ? Like more than 1GB size ? This is the big watch join The big watch problem . First, introduce a specific problem scenario , And then based on this, introduce their advantages ...

  7. Big data development practice :Hive Optimize the actual battle 1- Data skew and join Unrelated optimization

    Hive SQL All kinds of optimization methods are basically It's all about data skewing . Hive The optimization is divided into join Related optimization and join Unrelated optimization , From the reality of the project ,join Related optimizations account for Hive Most of the optimization , and join dependent ...

  8. Hadoop ecosystem -hive Optimization means - Job and query optimization

    Hadoop ecosystem -hive Optimization means - Job and query optimization author : Yin Zhengjie Copyright notice : Original works , Declined reprint ! Otherwise, the legal liability will be investigated .

  9. 【 turn 】Hive Optimization summary

    To optimize the , hold hive sql treat as map reduce Program to read , There will be unexpected surprises . understand Hadoop The core competencies of , yes hive The essence of optimization . This is the year that , Valuable experience of all members of the project team .   Long term observation hadoo ...

  10. hive Optimize ( turn )

    Hive Optimize Hive Optimization objectives With limited resources , More efficient execution common problem Data skew map Number setting reduce Number setting other Hive perform HQL --> Job --> Map/Reduce ...

Random recommendation

  1. annotation :【 With join table 】Hibernate A one-way 1->1 relation

    Person And Address relation : A one-way 1->1,[ With join table ] ( Use less !) package; import javax. ...

  2. from 0 Start learning Swift Take notes ( 3、 ... and )

    This is the follow-up to the last blog post : --Swift Related properties in the Storage attribute Swift The attributes in are divided into storage attributes and calculation attributes , The storage property is Objective-C Data members in , Calculated properties do not store data , But you can calculate other properties ...

  3. AI deep learning Caffe The framework is introduced , Excellent deep learning architecture

    AI deep learning Caffe The framework is introduced , Excellent deep learning architecture In the field of deep learning ,Caffe Frame is a mountain that people can't get around . It's not just because it's in structure . On the performance , Again, code quality , Both are excellent open source frameworks . It is more important ...

  4. From a standard url The extension of the file

    stay php One of the predefined functions is called "pathinfo()" Function of , Dedicated to returning file path information . Good. , Let's see what it can do for us ?       grammar :pathinfo($url_ ...

  5. ASP.NET Medium Excel operation (NPOI The way )

    Code preparation : One : Physical preparation The code is as follows : /// <summary> /// An entity type specification that can be added to an entity type to be exported to a specified row /// data:{int StartColIndex ? 0, in ...

  6. c++ Memory flow

    1.MemoryStream.h The contents of the document ifndef _MEM_STREAM_H_ #define _MEM_STREAM_H_ #include <string> class CMem ...

  7. sizeof(void) What's the usage?

    By chance found in C in sizeof(void) It's legal. , therefore , There are questions about its role . Access to information in GNU The following explanation is found in the document : In GNU C, addition and subtraction operatio ...

  8. Use scp Download files from remote server to local

    [ Download remote files to local ] scp -P 6008 root@   /Users/abc/www [ Upload local files to remote ] scp -P 6008  ...

  9. C Language second blog assignment --- Branching structure Chen Zhangxin

    One .PTA Experiment assignment subject 1: Calculating piecewise functions [2] This topic requires the calculation of the following piecewise functions f(x) Value : 1. Experimental code int main(){double x,y; scanf("%lf",&am ...

  10. [Swift]LeetCode290. Word patterns | Word Pattern

    Given a pattern and a string str, find if str follows the same pattern. Here follow means a full mat ...