1. Hudi Table corresponding to the Hive External tables introduce

Hudi The source table corresponds to one copy HDFS data , Can pass Spark,Flink Component or Hudi The client will Hudi The data of the table is mapped to Hive External table , Based on this external table , Hive It is convenient for real-time view , Read the query of optimization view and incremental view .

2. Hive Yes Hudi Integration of

Here we use Hive3.1.1、 Hudi 0.9.0 For example , Other versions are similar to

  • take hudi-hadoop-mr-bundle-0.9.0xxx.jar , hudi-hive-sync-bundle-0.9.0xx.jar Put it in hiveserver Node lib Under the table of contents

  • modify hive-site.xml find hive.default.aux.jars.path as well as hive.aux.jars.path These two configuration items , In the first step jar The full path of the package is configured : The configuration is as follows

  • Restart after configuration hive-server

  • about Hudi Of bootstrap surface (tez Inquire about ), In addition to adding hudi-hadoop-mr-bundle-0.9.0xxx.jar , hudi-hive-sync-bundle-0.9.0xx.jar these two items. jar package , We still need to put hbase-shaded-miscellaneous-xxx.jar, hbase-metric-api-xxx.jar,hbase-metrics-xxx.jar, hbase-protocol-shaded-xx.jar,hbase-shaded-protobuf-xxx.jar,htrce-core4-4.2.0xxxx.jar Follow the above steps to add .

3. establish Hudi Table corresponding to the hive External table

Generally speaking Hudi Watch in use Spark perhaps Flink When writing data, it will be automatically synchronized to Hive External table , At this point, you can directly beeline Query synchronized external tables , If the write engine does not turn on automatic synchronization , You need to use... Manually hudi Client tools run_hive_sync_tool.sh For synchronization, please refer to the official website to view relevant parameters .

4. Inquire about Hudi Table corresponding to the Hive External table

4.1 Premise of operation

Use Hive Inquire about Hudi Before the table , Need to pass through set Command settings hive.input.format, Otherwise, data duplication will occur , Query exception and other errors , For example, the following error message is a typical example of no setting hive.input.format As a result of

java.lang.IllegalArgumentException: HoodieRealtimeReader can oly work on RealTimeSplit and not with xxxxxxxxxx

In addition, for incremental queries , It also needs to be set Command additional settings 3 Parameters

set hoodie.mytableName.consume.mode=INCREMENTAL;
set hoodie.mytableName.consume.max.commits=3;
set hoodie.mytableName.consume.start.timestamp=commitTime;

Pay attention to this 3 The first parameter is a table level parameter

Parameter name describe
hoodie.mytableName.consume.mode Hudi Table query mode . Incremental query :INCREMENTAL Non incremental query : Not set or set to SNAPSHOT
hoodie.mytableName.consume.start.timestamp Hudi Table incremental query start time
hoodie. mytableName.consume.max.commits Hudi The table is based on hoodie.mytableName.consume.start.timestamp The increment to be queried later commit frequency . Submit the number , If set to 3 when , Represents that the incremental query starts after the specified start time commit 3 Secondary data , Set to -1 when , Incrementally queries all data submitted after the specified start time

4.2 COW type Hudi Table in the query

for example Hudi The original table name is hudicow, Sync to hive after hive Table name hudicow

4.2.1 COW Table real-time view query

Set up hive.input.format by org.apache.hadoop.hive.ql.io.HiveInputFormat perhaps org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat after , Like ordinary hive Just query like a table

set hive.input.format= org.apache.hadoop.hive.ql.io.HiveInputFormat;
select count(*) from hudicow;

4.2.2 COW Table incremental query

Except to set hive.input.format, You also need to set the above 3 Incremental query parameters , And... Must be added in the incremental query statement where Keyword and put _hoodie_commit_time > 'startCommitTime' As a filter ( This place is mainly hudi Merging small files will merge old and new commit Merge your data into new data ,hive It's impossible to get directly from parquet The file knows which data is new and which is old )

set hive.input.format = org.apache.hadoop.hive.ql.io.HiveInputFormat;
set hoodie.hudicow.consume.mode = INCREMENTAL;
set hoodie.hudicow.consume.max.commits = 3;
set hoodie.hudicow.consume.start.timestamp = xxxx;
select count(*) from hudicow where `_hoodie_commit_time` > 'xxxx'

Be careful _hoodie_commit_time The quotation marks are back quotation marks (tab The one above the key ) Not single quotes , 'xxxx' Is single quotes

4.3 MOR type Hudi Table in the query

for example mor type Hudi The table name of the source table is hudimor, Map to two Hive External table hudimor_ro(ro surface ) and hudimor_rt(rt surface )

4.3.1 MOR Table read optimization view

It's actually reading ro surface , and cow The table is similar to the setting hiveInputFormat after And ordinary hive Just query like a table .

4.3.2 MOR Table live view

Set up hive.input.format after , You can find Hudi The latest data of the source table

set hive.input.format = org.apache.hadoop.hive.ql.io.HiveInputFormat;
select * from hudicow_rt;

4.3.3 MOR Table incremental query

This incremental query is for rt surface , No ro surface . through COW Incremental queries for tables are similar to

set hive.input.format = org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; // This place is designated as HoodieCombineHiveInputFormat
set hoodie.hudimor.consume.mode = INCREMENTAL;set hoodie.hudimor.consume.max.commits = -1;
set hoodie.hudimor.consume.start.timestamp = xxxx;
select * from hudimor_rt where `_hoodie_commit_time` > 'xxxx'; // If the table name is rt surface

The explanation is as follows

  • set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; It's best to use only rt Incremental query of table , Of course, other types of queries can also be set to this , This parameter will affect ordinary hive Table query , So in rt After the table incremental query is completed , Should be set set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; Or change to the default value set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; Queries for other tables .

  • set hoodie.mytableName.consume.mode=INCREMENTAL; Incremental query mode for this table only , To switch the table to another query mode , Should be set set hoodie.hudisourcetablename.consume.mode=SNAPSHOT;

At present Hudi(0.9.0) docking Hive Some of the problems , Please use master Branch or upcoming 0.10.0 edition

  • hive read hudi The table will print all the data, which has serious performance problems and data security problems .

  • MOR Real time view reading of table Please set... As required mapreduce.input.fileinputformat.split.maxsize Size prohibit hive Take the file read by segmentation , Otherwise, data duplication will occur . There is no solution to this problem at present ,spark read hudi In real-time view, the code is written directly, and the file will not be segmented ,hive Manual setting required .

  • If you come across classNotFound, noSuchMethod Please check for errors hive lib Under the library jar Whether the package conflicts .

5. Hive Side source code modification

For support Hive Inquire about Hudi Pure log The file needs to be correct Hive Modify the side source code .

Specific modification org.apache.hadoop.hive.common.FileUtils The following functions

public static final PathFilter HIDDEN_FILES_PATH_FILTER = new PathFilter() {   
public boolean accept(Path p) {     
String name = p.getName();     
boolean isHudiMeta = name.startsWith(".hoodie");     
boolean isHudiLog = false;     
Pattern LOG_FILE_PATTERN = Pattern.compile("\\.(.*)_(.*)\\.(.*)\\.([0-9]*)(_(([0-9]*)-([0-9]*)-([0-9]*)))?");     
Matcher matcher = LOG_FILE_PATTERN.matcher(name);     
if (matcher.find()) {       
isHudiLog = true;     
boolean isHudiFile = isHudiLog || isHudiMeta;     
return (!name.startsWith("_") && !name.startsWith(".")) || isHudiFile;   


recompile hive, Put the newly compiled hive-common-xxx.jar, hive-exec-xxx.jar Replace the hive server Of lib Under the directory, pay attention to the permissions and names and the original jar Package consistency .

The last restart hive-server that will do .

