1. Hudi Table corresponding to the Hive External tables introduce

Hudi The source table corresponds to one copy HDFS data , Can pass Spark,Flink Component or Hudi The client will Hudi The data of the table is mapped to Hive External table , Based on this external table , Hive It is convenient for real-time view , Read the query of optimization view and incremental view .

2. Hive Yes Hudi Integration of

Here we use Hive3.1.1、 Hudi 0.9.0 For example , Other versions are similar to

  • take hudi-hadoop-mr-bundle-0.9.0xxx.jar , hudi-hive-sync-bundle-0.9.0xx.jar Put it in hiveserver Node lib Under the table of contents

  • modify hive-site.xml find hive.default.aux.jars.path as well as hive.aux.jars.path These two configuration items , In the first step jar The full path of the package is configured : The configuration is as follows

  • Restart after configuration hive-server

  • about Hudi Of bootstrap surface (tez Inquire about ), In addition to adding hudi-hadoop-mr-bundle-0.9.0xxx.jar , hudi-hive-sync-bundle-0.9.0xx.jar these two items. jar package , We still need to put hbase-shaded-miscellaneous-xxx.jar, hbase-metric-api-xxx.jar,hbase-metrics-xxx.jar, hbase-protocol-shaded-xx.jar,hbase-shaded-protobuf-xxx.jar,htrce-core4-4.2.0xxxx.jar Follow the above steps to add .

3. establish Hudi Table corresponding to the hive External table

Generally speaking Hudi Watch in use Spark perhaps Flink When writing data, it will be automatically synchronized to Hive External table , At this point, you can directly beeline Query synchronized external tables , If the write engine does not turn on automatic synchronization , You need to use... Manually hudi Client tools run_hive_sync_tool.sh For synchronization, please refer to the official website to view relevant parameters .

4. Inquire about Hudi Table corresponding to the Hive External table

4.1 Premise of operation

Use Hive Inquire about Hudi Before the table , Need to pass through set Command settings hive.input.format, Otherwise, data duplication will occur , Query exception and other errors , For example, the following error message is a typical example of no setting hive.input.format As a result of

java.lang.IllegalArgumentException: HoodieRealtimeReader can oly work on RealTimeSplit and not with xxxxxxxxxx

In addition, for incremental queries , It also needs to be set Command additional settings 3 Parameters

set hoodie.mytableName.consume.mode=INCREMENTAL;
set hoodie.mytableName.consume.max.commits=3;
set hoodie.mytableName.consume.start.timestamp=commitTime;

Pay attention to this 3 The first parameter is a table level parameter

Parameter name describe
hoodie.mytableName.consume.mode Hudi Table query mode . Incremental query :INCREMENTAL Non incremental query : Not set or set to SNAPSHOT
hoodie.mytableName.consume.start.timestamp Hudi Table incremental query start time
hoodie. mytableName.consume.max.commits Hudi The table is based on hoodie.mytableName.consume.start.timestamp The increment to be queried later commit frequency . Submit the number , If set to 3 when , Represents that the incremental query starts after the specified start time commit 3 Secondary data , Set to -1 when , Incrementally queries all data submitted after the specified start time

4.2 COW type Hudi Table in the query

for example Hudi The original table name is hudicow, Sync to hive after hive Table name hudicow

4.2.1 COW Table real-time view query

Set up hive.input.format by org.apache.hadoop.hive.ql.io.HiveInputFormat perhaps org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat after , Like ordinary hive Just query like a table

set hive.input.format= org.apache.hadoop.hive.ql.io.HiveInputFormat;
select count(*) from hudicow;

4.2.2 COW Table incremental query

Except to set hive.input.format, You also need to set the above 3 Incremental query parameters , And... Must be added in the incremental query statement where Keyword and put _hoodie_commit_time > 'startCommitTime' As a filter ( This place is mainly hudi Merging small files will merge old and new commit Merge your data into new data ,hive It's impossible to get directly from parquet The file knows which data is new and which is old )

set hive.input.format = org.apache.hadoop.hive.ql.io.HiveInputFormat;
set hoodie.hudicow.consume.mode = INCREMENTAL;
set hoodie.hudicow.consume.max.commits = 3;
set hoodie.hudicow.consume.start.timestamp = xxxx;
select count(*) from hudicow where `_hoodie_commit_time` > 'xxxx'

Be careful _hoodie_commit_time The quotation marks are back quotation marks (tab The one above the key ) Not single quotes , 'xxxx' Is single quotes

4.3 MOR type Hudi Table in the query

for example mor type Hudi The table name of the source table is hudimor, Map to two Hive External table hudimor_ro(ro surface ) and hudimor_rt(rt surface )

4.3.1 MOR Table read optimization view

It's actually reading ro surface , and cow The table is similar to the setting hiveInputFormat after And ordinary hive Just query like a table .

4.3.2 MOR Table live view

Set up hive.input.format after , You can find Hudi The latest data of the source table

set hive.input.format = org.apache.hadoop.hive.ql.io.HiveInputFormat;
select * from hudicow_rt;

4.3.3 MOR Table incremental query

This incremental query is for rt surface , No ro surface . through COW Incremental queries for tables are similar to

set hive.input.format = org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; // This place is designated as HoodieCombineHiveInputFormat
set hoodie.hudimor.consume.mode = INCREMENTAL;set hoodie.hudimor.consume.max.commits = -1;
set hoodie.hudimor.consume.start.timestamp = xxxx;
select * from hudimor_rt where `_hoodie_commit_time` > 'xxxx'; // If the table name is rt surface

The explanation is as follows

  • set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; It's best to use only rt Incremental query of table , Of course, other types of queries can also be set to this , This parameter will affect ordinary hive Table query , So in rt After the table incremental query is completed , Should be set set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; Or change to the default value set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; Queries for other tables .

  • set hoodie.mytableName.consume.mode=INCREMENTAL; Incremental query mode for this table only , To switch the table to another query mode , Should be set set hoodie.hudisourcetablename.consume.mode=SNAPSHOT;

At present Hudi(0.9.0) docking Hive Some of the problems , Please use master Branch or upcoming 0.10.0 edition

  • hive read hudi The table will print all the data, which has serious performance problems and data security problems .

  • MOR Real time view reading of table Please set... As required mapreduce.input.fileinputformat.split.maxsize Size prohibit hive Take the file read by segmentation , Otherwise, data duplication will occur . There is no solution to this problem at present ,spark read hudi In real-time view, the code is written directly, and the file will not be segmented ,hive Manual setting required .

  • If you come across classNotFound, noSuchMethod Please check for errors hive lib Under the library jar Whether the package conflicts .

5. Hive Side source code modification

For support Hive Inquire about Hudi Pure log The file needs to be correct Hive Modify the side source code .

Specific modification org.apache.hadoop.hive.common.FileUtils The following functions

public static final PathFilter HIDDEN_FILES_PATH_FILTER = new PathFilter() {   
public boolean accept(Path p) {     
String name = p.getName();     
boolean isHudiMeta = name.startsWith(".hoodie");     
boolean isHudiLog = false;     
Pattern LOG_FILE_PATTERN = Pattern.compile("\\.(.*)_(.*)\\.(.*)\\.([0-9]*)(_(([0-9]*)-([0-9]*)-([0-9]*)))?");     
Matcher matcher = LOG_FILE_PATTERN.matcher(name);     
if (matcher.find()) {       
isHudiLog = true;     
boolean isHudiFile = isHudiLog || isHudiMeta;     
return (!name.startsWith("_") && !name.startsWith(".")) || isHudiFile;   


recompile hive, Put the newly compiled hive-common-xxx.jar, hive-exec-xxx.jar Replace the hive server Of lib Under the directory, pay attention to the permissions and names and the original jar Package consistency .

The last restart hive-server that will do .

Apache Hudi And Hive More related articles in the integration manual

  1. Apache Hudi and Presto The past and this life

    One by Apache Hudi PMC Bhavani Sudha Saktheeswaran and AWS Presto Team Engineer Brandon Scheller Share Apache Hudi and Presto Integrate ...

  2. Apache Hudi Shuang company has been integrated by the top cloud service providers in China !

    Yes , Recently, Tencent cloud, a domestic cloud service provider, is in its EMR-V2.2.0 Prior integration of Hudi 0.5.1 Version as its cloud data Lake solution to provide services Apache Hudi stay HDFS Insert updates and ...

  3. Hive Integrate Hudi practice ( With code )| Maybe the most detailed data series of the whole network

    The official account is being asked more and more people about the content of data lake , It seems that people are still very interested in the new technology . There are few data about the data lake on the Internet , Especially the practice series , For new technology , Basic introductory documentation is necessary , So I hope this article can help to think ...

  4. ecology | Apache Hudi Integrate Alluxio practice

    Link to the original text :https://mp.weixin.qq.com/s/sT2-KK23tvPY2oziEH11Kw 1. What is? Alluxio Alluxio It builds a bridge between data-driven application and storage system , Take data from ...

  5. Apache Hudi And Apache Flink Integrate

    Thank you, Wang Xianghu @wangxianghu contribute Apache Hudi By Uber Develop and open source data Lake framework , It's on 2019 year 1 The moon enters Apache Incubators hatch , The following year 5 I graduated in June and was promoted to Apache Top projects . It's the most ...

  6. Apache Hudi Integrate Spark SQL experience

    Apache Hudi Integrate Spark SQL experience 1. Abstract Community partners have been looking forward to Hudi Integrate Spark SQL Of PR Being positive Review It's coming to an end ,Hudi Integrate Spark SQL It's expected to be in ...

  7. Use Amazon EMR and Apache Hudi stay S3 Insert up , to update , Delete data

    Store data in Amazon S3 There are a lot of benefits , Including scale . reliability . Cost efficiency, etc . most important of all , You can use Amazon EMR Medium Apache Spark,Hive and Presto Open source tools to handle and divide ...

  8. Official announcement !Amazon EMR Formal support Apache Hudi

    ​Apache Hudi Is an open source data management framework , By providing a record level of insert, update, upsert and delete Ability to simplify incremental data processing and data pipeline development .Upsert It means inserting a record into an existing ...

  9. actual combat | take Apache Hudi Data sets are written to alicloud OSS

    1. introduce The low cost of object storage on the cloud makes many companies regard it as the main storage solution , and Hudi As a data Lake solution , Support for object storage is also essential . Before AWS EMR Built in Integration Hudi, It also means that you can S3 On seamless use Hudi. When ...

  10. Use Apache Spark and Apache Hudi Build a lake of analytical data

    1. introduce Most modern data lakes are based on some kind of distributed file system (DFS), Such as HDFS Or cloud based storage , Such as AWS S3 Built . One of the basic principles to follow is that of documentation " Write multiple reads at a time " Access model . This is for processing ...

Random recommendation

  1. Filter The way filter is written

    http://pengenjing.iteye.com/blog/1607248 The filter here is in adapter mode , The idea is : First write a class implementation Filter, Then let the filter you write inherit from this class : step :1. ...

  2. utilize cocostudio Library function Realize the backpack bar sliding left and right UI (cocos2d-x 2.2.0)

    .h #ifndef __COMMON_COMPONENTS__ #define __COMMON_COMPONENTS__ #include "cocos2d.h" #inclu ...

  3. Super practical pressure testing tool -ab Tools

    I'm learning ab Before the tool , We need to understand a few concepts about stress testing Throughput rate (Requests per second) Concept : A quantitative description of the concurrent processing power of a server , The unit is reqs/s, It refers to the requests processed per unit time under a certain number of concurrent users ...

  4. ASP.NET in application object

    ASP.NET in application Use of objects . Application Object application   1. Use Application Object to save information   (1). Use Application Object to save information   Applicat ...

  5. CSS3 How to achieve 2D Conversion and 3D transformation , What's the difference between them

    CSS3 in 2D3D Technological development , Bring more rich visual effects ~ What is their implementation mechanism ? 1 Definition 2D: Ability to move elements , The zoom , turn , To lengthen or stretch . 3D: Allows formatting of elements , Operate in three-dimensional space . Element modification ...

  6. javascript Simulate the Enter key event

    <script> $(function(){ var _login = function (){ var _name = $('#name'); var _password = $('#p ...

  7. AAU Account segmentation

    import win.ui; import fsys.dlg; import string.list; /*DSG{{*/ var winform = win.form(text="aard ...

  8. mac Upgrade to php7

    Use homebrew install php7 brew update # Update source brew search php # Find... In the source php, Found to have php7.1 edition , Install the latest php7.1 brew install php ...

  9. js Slide

    The basic idea Red : Is the visible area Black box : Elements , invisible . By means of absolute positioning , Put the black box , Move to red to see the difference , To achieve picture switching .   example Create a slide instance object <div class="slide& ...

  10. PostgreSQL&amp;PostGIS Completely installed

    Check PostGIS.PostgreSQL.GEOS.GDAL.PROJ And so on http://trac.osgeo.org/postgis/wiki/UsersWikiPostgreSQ ...