Diversified exploration and practice of Apache Flink in bilibilibili
InfoQ 2021-06-04 10:37:58
{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" This paper is written by bilibili Zheng Zhisheng, director of big data real-time platform, shares , This sharing core explains the implementation of trillion level transmission and distribution architecture , as well as AI How domains are based on Flink Build a complete set of preprocessing real-time Pipeline. This sharing mainly focuses on the following four aspects :","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" One 、B Station real-time past and present life ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Two 、Flink On Yarn The incremental pipeline scheme of ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 3、 ... and 、Flink and AI Some engineering practice of the direction ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Four 、 Future development and thinking ","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GitHub Address ","attrs":{}},{"type":"link","attrs":{"href":"https://github.com/apache/flink","title":null,"type":null},"content":[{"type":"text","text":"https://github.com/apache/flink","attrs":{}}]},{"type":"text","text":" Welcome to Flink Like to send star~","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":" One 、B Station real-time past and present life ","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. Ecological scene radiation ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Speaking of the future of real-time computing , The key word is the effectiveness of the data . First of all, from the ecology of the whole big data development , Take a look at its core scene radiation : In the early days of big data development , The core is the scenario of off-line computing with day oriented granularity . At that time, most of the data effectiveness was based on calculation, in days , It's more about balancing time and cost .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" With the application of data , The popularization and improvement of data analysis and data warehouse , More and more people put forward higher requirements for the effectiveness of data . such as , When you need to do some real-time data recommendation , The effectiveness of data will determine its value . under these circumstances , The whole scene of real-time computing was born .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" But in the actual operation process , There are also many scenes , In fact, there is no very high real-time requirement for data , In this case, there must be data from milliseconds , Some new scenes of seconds or days , The real-time scene data are more incremental computing scenarios with minute granularity . For offline computing , It's more cost focused ; For real-time computing , It pays more attention to value and effectiveness ; And for incremental computing , It's more about balancing costs , And the combined value and time .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0e/0e17a31aefb7fe722c5fb6df4b4a10f4.jpeg","alt":" picture 2","title":" picture 2","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. B The timeliness of the station ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In three dimensions ,B What is the division of stations ? about B For station , There are 75% Our data is supported by offline computing , And then there is 20% Our scene is through real-time computing , 5% It's through incremental calculation .","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" For real-time computing scenarios , It is mainly applied to the whole real-time machine learning 、 Real-time recommendation 、 Ad search 、 Data applications 、 Real time channel analysis 、 report form 、olap、 Monitoring etc. ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" For offline computing , There is a wide range of data , It is mainly based on data warehouse ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" For incremental computing , It's only this year that we've started some new scenes , for instance binlog The incremental Upsert scene .","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8c/8c602abaeaf2ed79c67b5d964e356017.jpeg","alt":" picture 3","title":" picture 3","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. ETL Poor timeliness ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" On the issue of effectiveness , In fact, I met a lot of pain points in the early stage , The core focuses on three aspects :","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" First of all , The transmission pipeline lacks computing power . Early plans , Basically, data should be sent to ODS ,DW The layer is the day after midnight to scan all the previous day ODS Layer data , in other words , There is no way to pre clean the whole data ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" second , Resources with a large number of jobs are concentrated after midnight , There will be a lot of pressure on the whole resource arrangement ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Third 、 Real time and offline gap It's hard to satisfy , Because for most of the data , The cost of pure real-time is too high , The effectiveness of pure offline is too poor . meanwhile ,MySQL The warehousing time of data is not enough . for instance , like B The barrage data of the station , Its size is very exaggerated , This kind of business table synchronization often takes more than ten hours , And it's very unstable .","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/47/4797e30726e31f402cd4be3525857ce4.jpeg","alt":" picture 4","title":" picture 4","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. AI Real time engineering is complex ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Besides the issue of effectiveness In the early days, I met AI Real time engineering is a complex problem :","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" First of all , It's a problem of computational efficiency of the whole feature engineering . The same real-time feature computing scenario , We also need to do data backtracking in offline scenarios , Computational logic will be developed repeatedly ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" second , The whole real-time link is quite long . A complete real-time recommendation link , covers N It's a real-time and M It is composed of more than ten offline jobs , Sometimes we have problems , The cost of operation and control of the whole link is very high ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Third 、 With AI The increase in personnel , The input of algorithmic personnel , Experimental iterations are hard to scale horizontally .","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/fd/fd2651e74f4eee4c9a2c5e4e989aa1e7.jpeg","alt":" picture 5","title":" picture 5","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5. Flink Do the ecological practice ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In the context of these key pain points , We focus on Flink Do the ecological practice , The core includes the application of the whole real-time data warehouse and the whole incremental data warehouse ETL The Conduit , And facing AI Some of the scenes of machine learning . This sharing will focus more on incremental pipeline and AI Add Flink In the direction of . The figure below shows the overall scale , at present , The whole volume of transmission and computation , On the trillions scale, there are 30000+ Count the kernels ,1000+ job The number and 100 Multiple users .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c4/c48316af684c8b03c5567df18cecccd6.jpeg","alt":" picture 11","title":" picture 11","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":" Two 、Flink On Yarn  The incremental pipeline scheme of ","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. Early Architecture ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Let's take a look at the early architecture of the pipeline , As can be seen from the figure below , The data is mainly through Flume To consume Kafka Fall on HDFS.Flume With its transaction mechanism , To make sure that the data comes from Source To Channel, Until then Sink Consistency of time , Finally, the data fell to HDFS after , The downstream Scheduler Will scan the directory to see if there are tmp file , To determine if the data is Ready, In order to dispatch and pull up the downstream ETL Work offline .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/2b/2b095f30ec59904c6ea29f8f305bf099.jpeg","alt":" picture 7","title":" picture 7","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. Pain points ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" I met a lot of pain points in the early stage :","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The first key is data quality .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The first thing to use is MemoryChannel, There will be data loss , I tried to use it later FileChannel The pattern of , But the performance can not meet the requirements . In addition to HDFS In an unstable situation ,Flume The transaction mechanism will cause the data to rollback Roll back to Channel, To some extent, it will lead to repeated data . stay HDFS In a situation of extreme instability , The highest repetition rate will reach the percentile probability ;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lzo Row store , In the early days, the whole transmission was in the form of separators , This kind of separator is Schema It's weakly constrained , And it doesn't support nested formats .","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The second point is the timeliness of the whole data , Unable to provide minute level query , because Flume Unlike Flink Yes Checkpoint The chopping mechanism , More through idle Mechanism to control the closing of files ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The third point is downstream ETL linkage . As mentioned above , We're more scanning tmp Is the directory ready The plan , In this case scheduler There will be a lot of and NameNode call hadoop list Of api, And that leads to NameNode There's a lot of pressure .","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/dc/dc36a937cf047b3c96dafdaeec11c5cc.jpeg","alt":" picture 6","title":" picture 6","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. Stability related pain points ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" There are also many problems in stability :","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" First of all ,Flume It doesn't have a state , Node exception or after restart ,tmp Can't shut down properly ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" second , In the early days, there was no environment attached to big data , It's a pattern of physical deployment , It's hard to control resource scaling , The cost will also be relatively high ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Third ,Flume and HDFS There's a problem with communication . For example, when writing HDFS There is a blockage , The blockage of a node will back pressure to Channel, It will lead to Source Don't go to Kafka Consumption data , Stop pulling offset, To some extent, it will trigger Kafka Of Rebalance, In the end, it will lead to the whole situation offset Don't push forward , This leads to the accumulation of data .","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/95/95d1ce770847d2b7106d0749266aeac7.jpeg","alt":"Image1","title":"Image1","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. Trillions of incremental pipelines DAG View ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Under the above pain points , The core solution is based on Flink Build a trillion level incremental pipeline , Here's the whole runtime DAG View .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" First , stay Flink Under the architecture ,KafkaSource Put an end to rebalance The avalanche problem of , Even if the whole DAG There is a data write in the view with a certain concurrency HDFS Clogging of , It won't lead to global ownership Kafka The blocking of the partition . Besides , The essence of the whole plan is through Transform Module to achieve scalable nodes .","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The first layer of nodes is Parser, It is mainly to do data decompression, deserialization and other parsing operations ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The second layer is the introduction of customization provided to users ETL modular , It can realize the custom cleaning of data in the pipeline ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The third layer is Exporter modular , It supports exporting data to different storage media . Like writing HDFS when , It will be exported to parquet; writes Kafka, It will be exported to pb Format . meanwhile , Throughout DAG It introduces ConfigBroadcast To solve the real-time update of pipeline metadata 、 The problem of thermal loading . Besides , In the whole link , Once every minute checkpoint, For incremental real data Append, This will provide minute level queries .","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cd/cd302cb4688de2f35a0c73c7218c04b0.jpeg","alt":" picture 2","title":" picture 2","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5. Trillions of incremental pipeline overall view ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Flink On Yarn The overall structure of , It can be seen that the whole pipeline view is divided into BU Unit . Every Kafka Of topic, Both represent the distribution of certain data terminals ,Flink The job will be responsible for the write processing of various terminal types . You can also see in the view , in the light of blinlog The data of , It also realizes the assembly of the whole pipeline , The operation of the pipeline can be realized by multiple nodes .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c8/c8962506ff7a3850cb55c1ed666283f2.jpeg","alt":" picture 1","title":" picture 1","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6. Technical highlights ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Next, let's take a look at some technical highlights of the core of the whole architecture scheme , The first three are some features of real-time function , The last three are mainly in some non functional level optimization .","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" For data models , Mainly through parquet, utilize Protobuf To parquet To achieve convergence of the scheme ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Partition notification is mainly because a pipeline is actually processing multiple streams , The core solution is to partition multiple streams of data ready Notification mechanism of ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CDC Pipes are more of a use of binlog and HUDI To achieve upsert Problem solving ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Small files are mainly run through DAG Topology to solve the problem of file merging ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HDFS Communication is actually the optimization of many key problems in the scale of trillions ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Finally, some optimization of partition fault tolerance .","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/92/925ff99b31f012c778c939c2b99ffe55.jpeg","alt":" picture 3","title":" picture 3","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"6.1 Data model ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Business development is mainly through assembling strings , To assemble the reporting of records of data . Later, the definition and management of the model are adopted , And its development to organize , It is mainly provided to users to record each stream through the entrance of the platform 、 Each table , its Schema ,Schema It will be generated Protobuf The file of , Users can download it on the platform Protobuf Corresponding HDFS Model file , such ,client The end of the development can be completely through the strong Schema The way is from pb To constrain .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Take a look at the runtime process , First Kafka Of Source I'm going to consume every piece that actually comes from the upstream RawEvent The record of ,RawEvent There will be PBEvent The object of ,PBEvent It's actually one by one Protobuf The record of . Data from Source Where does it flow Parser modular , After analysis, it will form PBEvent,PBEvent Will be the user in the platform to enter the entire Schema Model , Stored in OSS Object system ,Exporter The module will dynamically load model changes . And then through pb File to reflect the generated specific event object , Finally, the event object can be mapped parquet The format of . Here we mainly do a lot of cache reflection optimization , Make the whole pb The performance of dynamic parsing is improved by six times . Last , We will land the data on HDFS, formation parquet The format of .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/45/45cd02598387abc75d6c8330d5539d76.jpeg","alt":" picture 5","title":" picture 5","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"6.2 Partition notification optimization ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" As mentioned earlier, pipes handle hundreds of streams , In the early Flume The architecture of , In fact, each of them Flume node , It's hard to sense the progress of its own processing . meanwhile ,Flume There is no way to deal with the overall progress . But based on Flink, You can go through Watermark To solve .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" First, in the Source It will be based on Eventime To generate Watermark,Watermark It will be processed by each layer and transferred to Sink, In the end, it will pass Commiter modular , Summarize all... In a single threaded way Watermark The progress of the message . When it finds the whole picture Watermark It's time to move on to the next hour's partition , It will send a message to Hive MetStore, Or write to Kafka, To inform the last hour of partition data ready, So that the downstream scheduling can be faster through the message driven way to pull up the operation of the job .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/58/58c198f9b2c10f0bc4a52cc260d0603f.jpeg","alt":" picture 4","title":" picture 4","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"6.3 CDC Optimization on the pipeline ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" On the right side of the picture is actually the whole cdc Pipeline, complete link . To achieve MySQL Data to Hive Complete mapping of data , We need to solve the problems of streaming and batch processing .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The first is through Datax take MySQL All the data is synchronized to HDFS. And then through spark Of job, Initialize the data to HUDI Initial snapshot of , Then passed Canal To achieve Mysql Of binlog I'm dragging my data to Kafka Of topic, And then through Flink Of Job Combining the data of initial snapshot with incremental data for incremental update , In the end HUDI surface .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The whole link is to solve the problem of no loss and no weight of data , The focus is on Canal Write Kafka This piece of , The mechanism of opening a transaction , Make sure the data goes down Kafka topic When , Can achieve the data in the transmission process is not lost is not heavy . in addition , In fact, there may be data duplication and loss in the upper layer of data transmission , At this time, it is more through global uniqueness id Add millisecond timestamps . Throughout the flow Job in , For the overall situation id To do data De duplication , Sort the data in milliseconds , This can ensure that the data can be updated in an orderly way HUDI.","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" And then through Trace Our system is based on Clickhouse For storage , To count the number of data in and out of each node to achieve accurate data comparison .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/12/12f306f3b70411874f4e65679594ed18.jpeg","alt":" picture 12","title":" picture 12","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"6.4 stability - Merging small files ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Mentioned earlier , Transform into Flink after , We did it every minute Checkpoint, The amplification of the number of files is very serious . Mainly in the whole DAG To introduce merge Of operater To merge files ,merge The main merging method is horizontal merging based on concurrency , One writer Will correspond to one merge. So every five minutes Checkpoint,1 Hours of 12 File , Will merge . In two ways , You can control the number of files in a reasonable range .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/62/626f042cea39aafaaa5c7fb299fbec9c.jpeg","alt":" picture 6","title":" picture 6","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"6.5 HDFS signal communication ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In the actual operation process, we often encounter serious problems in the whole operation , In fact, the main point of practical analysis is and HDFS Communication has a lot to do with .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Actually HDFS Communications , It combs four key steps : initialization state、Invoke、Snapshot as well as Notify Checkpoint complete.","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The core problem is that Invoke Stage ,Invoke The scrolling condition of the file will be reached , This will trigger flush and close.close Actual and NameNode When communicating , There will be frequent jams .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Snapshot There's also a problem with this stage , A hundred streams in a pipe, once triggered Snapshot, Serial execution flush and close It's going to be very slow .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Core optimization focuses on three aspects :","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" First of all , Reduced file fragmentation , That is to say close Frequency of . stay Snapshot Stage , Don't go to close Close file , And more through the way of file continuation . such , Initializing state The stage of , You need to make documents Truncate To do it Recovery recovery .","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" second , It's asynchronous close Improvement , Can be said to be close This action will not block the processing of the whole link , in the light of Invoke and Snapshot Of close, Will manage the state to state among , By initializing state To recover files .","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Third , For multiple streams ,Snapshot Parallel processing is also done , Every time 5 Minutes of Checkpoint, Multiple streams are actually multiple streams bucket, It will be processed serially through loops , Then we can transform it through multithreading , You can reduce Checkpoint timeout Happen .","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9c/9c6072110b7e4a8d6bf034321f236ac2.jpeg","alt":" picture 7","title":" picture 7","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"6.6 Some optimizations of partition fault tolerance ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In fact, in the case of multiple flows in the pipeline , Some streams are not continuous every hour .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" This kind of situation will bring partition , its Watermark There's no way to move forward , Causes the problem of empty partitions . So we are in the process of running the pipeline , introduce PartitionRecover modular , It will be based on Watermark To advance the zoning notification . For some streams Watermark, If in ideltimeout Without updating ,Recover Module to add partition . It will arrive at the end of each partition , add delay time To scan all the streams Watermark, So that's how it goes .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In the process of transmission , When Flink When the job restarts , There will be a wave of zombie files , We do it through DAG Of commit The node of , To clean up and delete the zombie files before the whole partition notification , To clean up the whole zombie file , These are all non functional optimizations .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/68/684a7124f4ece7c5f4dcd684e5728c9f.jpeg","alt":" picture 8","title":" picture 8","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":" 3、 ... and 、Flink and AI Some engineering practice of the direction ","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. Architecture evolution schedule ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The picture below is AI Direction in real-time architecture, complete timeline .","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" As early as 2018 year , Many algorithmers' experimental development is workshop style . Each algorithmic person will be familiar with the language according to their own , for instance Python,php or c++ To choose different languages to develop different experimental projects . It's very expensive to maintain , And prone to failure ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2019 In the first half of , Mainly based on Flink Provides jar Package pattern to do some engineering support for the whole algorithm , It can be said that at the beginning of the whole first half of the year , It's more about stability , Commonality to do some support ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2019 Second half of , It's through self research BSQL, It greatly reduces the threshold of model training , solve label as well as instance To improve the efficiency of the whole experiment iteration ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2020 In the first half of , More calculation around the whole feature , Flow batch computing and feature engineering efficiency , To make some improvements ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" To 2020 Second half of , It's more about the flow and introduction of the whole experiment AIFlow, Convenient to do streaming batch DAG.","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/be/beb6b14704dde7c4f86e6a2d70e5af4b.jpeg","alt":" picture 9","title":" picture 9","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. AI Engineering Architecture Review ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Review the whole AI engineering , Its early architecture actually reflected the whole AI stay 2019 Architecture view at the beginning of the year , The essence of this is through some single task The way , Some computing nodes composed of various mixed languages , To support the whole model training . after 2019 Years of iteration , Replace the whole near line training with BSQL To develop and iterate .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/91/91876709173949a80ac4f31d070117c5.jpeg","alt":" picture 11","title":" picture 11","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. Status quo pain point ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" stay 2019 end of the year , In fact, there are some new problems , These problems mainly focus on the functional and non functional dimensions .","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" At the functional level :","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" First of all, from the label Go to produce instance flow , And to model training , Go online and predict , And the real experimental results , The whole link is very long and complex ;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" second , The whole real-time feature 、 Offline features 、 And the integration of streaming and batch , It involves a lot of homework , The whole link is complex . At the same time, the experiment and online We have to do the calculation of features , Inconsistent results can lead to problems with the final effect . Besides , Features are hard to find anywhere , There's no way to trace .","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3f/3f09e7cc2848b239f4225c79c7a6b06b.jpeg","alt":" picture 12","title":" picture 12","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" On a non functional level , Algorithm students often encounter , I do not know! Checkpoint What is it? , Do you want to drive , What's the configuration . Besides , It's not easy to check when there's a problem online , The whole link is very long .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" So the third point is , There are a lot of resources involved in a complete experimental schedule , But for the algorithm, it doesn't know what these resources are and how much they need , In fact, all of these problems have great confusion to the algorithm .","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. The pain comes down to ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" in the final analysis , Focus on three areas :","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The first is about consistency . From data preprocessing , To model training , And then to prediction , Each link is actually a fault . This includes data inconsistencies , It also includes inconsistencies in computational logic ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" second , The whole iteration of the experiment is very slow . A complete experimental link , In fact, for algorithm students , He needs to master a lot of things . At the same time, there is no way to share the materials behind the experiment . For example, some features , Every experiment has to be repeated ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Third , The cost of operation and control is relatively high .","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Complete experimental links , In fact, it is composed of a real-time project and an offline project link , Online problems are hard to check .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/49/498b5590957918b006cd305d148359e0.jpeg","alt":" picture 13","title":" picture 13","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5. real time AI The rudiment of Engineering ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" At some of these pain points , stay 20 The year is mainly focused on AI To create the prototype of real-time engineering . The core is to make breakthroughs through the following three aspects .","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The first is in BSQL On some of our abilities , For algorithms , Hope to face SQL In order to reduce project investment ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The second is feature engineering , Will solve some problems of feature calculation through the core to meet some support of features ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The third is the cooperation of the whole experiment , The purpose of the algorithm is actually to experiment , Hope to build a set of end-to-end experimental collaboration , Finally, we hope that the algorithm can be “ One click experiment ”.","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/fb/fb4a826c7e82090ea76c03c68942ee1e.jpeg","alt":" picture 14","title":" picture 14","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6. Feature Engineering - difficulty ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" We encountered some difficulties in feature engineering .","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The first is in real-time feature computing , Because it needs to leverage the results to the entire online forecasting service , So it requires very high delay and stability ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Second, the whole real-time and offline computing logic is consistent , We often come across a real-time feature , It needs to go back in time 30 The day is coming 60 Days of offline data , How can real-time feature computing logic be reused in offline feature computing ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Third, it is difficult to get through the integration of offline features . The computing logic of real-time features often has some streaming concepts such as window timing and so on , But offline features don't have these semantics .","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4b/4b816ef00dc5f969b0eaec902be7fdc1.jpeg","alt":" picture 15","title":" picture 15","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"7. Real time features ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Here's how we do real-time features , On the right side of the picture are some of the most typical scenes . For example, I want to count users' last minute in real time 、6 Hours 、12 Hours 、24 Hours , To each UP The number of times the main related video is played . For this scenario , Actually, there are two points in it :","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" First of all 、 It needs to use the sliding window to do the calculation of the whole user's past history . Besides , Data is in the process of sliding calculation , It also needs to be relevant UP Some basic information dimension tables of the main , To get UP Some of the main video to count the number of times he played . in the final analysis , In fact, I met two big pains .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" use Flink Native sliding window , Minutes of sliding , It will result in more windows , The loss of performance is relatively large .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Fine grained windows can also cause too many timers , The cleaning efficiency is relatively poor .","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The second is dimension table query , There will be many key To check HBASE Multiple corresponding value, In this case, we need to support array concurrent queries .","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Under two pain points , For sliding windows , It is mainly transformed into Group By The pattern of , add agg Of UDF The pattern of , It's going to take the whole hour 、 Six hours 、 Twelve hours 、 Some 24-hour window data , Store it all over Rocksdb among . This way UDF Pattern , The whole data trigger mechanism can be based on Group By Achieve record level trigger , The whole semantics 、 The timeliness will be greatly improved . At the same time in the whole AGG Of UDF Among functions , adopt Rocksdb To do it state, stay UDF To maintain the life cycle of data . In addition, it extends the entire SQL The implementation of the array level dimensional table query . The final whole effect can actually be in the direction of real-time features , Support various computing scenarios through the mode of large window .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e0/e04108d0c9a8523a2001a971c60a0f91.jpeg","alt":" picture 16","title":" picture 16","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"8. features - offline ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Let's take a look at offline , The top half of the view on the left is the complete real-time feature computing link , It can be seen that to solve the same problem SQL, It can also be reused in offline computing , Then we need to solve some corresponding calculation problems IO Can be reused . For example, in streaming, it's through Kafka To input data , Offline, you need to go through HDFS To input data . In flow, it's through KFC perhaps AVBase And so on kv Engine to support , Offline, you need to go through hive Engine to solve , in the final analysis , In fact, we need to solve three problems :","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" First of all , We need to simulate the whole ability of streaming consumption , It can support offline consumption HDFS data ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" second , Need to be solved HDFS The problem of data partition order in the process of consumption , similar Kafka Consumption in different areas ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Third , Need to simulate kv Engine maintenance table of consumption , Implementation is based on hive The dimension table consumption of . There is another problem that needs to be solved , When from HDFS Every record pulled , Every record is actually consumption hive There's a corresponding one in every table Snapshot, It's the time stamp of every piece of data , To consume the partition corresponding to the data timestamp .","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/95/95b9ec64655aea77affd691942fef242.jpeg","alt":" picture 17","title":" picture 17","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"9. Optimize ","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"9.1 offline - The division is orderly ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In fact, the scheme of orderly partition is mainly based on data falling HDFS When , Some modifications have been made . First of all, the data is falling HDFS Before , It's the transmission pipeline , adopt Kafka Consumption data . stay Flink My homework starts from Kafka After pulling the data , adopt Eventtime To extract data watermark, every last Kafka Source The degree of concurrency will watermark Report to JobManager In the middle of GlobalWatermark modular ,GlobalAgg It will aggregate the data from each concurrency Watermark Progress , So we can count GlobalWatermark The progress of the . according to GlobalWatermark To figure out which concurrency is involved Watermark The problem of computing too fast , Thus through GlobalAgg Issue to Kafka Source Control information ,Kafka Source In some cases where concurrency is too fast , It's going to slow down the whole area . such , stay HDFS Sink modular , The entire record of data received on the same time slice Event time Basically orderly , Finally fell to HDFS It also identifies its corresponding partition and time slice range on the file name . Last in HDFS In the partition Directory , You can achieve an ordered directory of data partitions .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e1/e18d9d80c926edd977540a8ad0ffa998.jpeg","alt":" picture 18","title":" picture 18","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"9.2 offline - Partition incremental consumption ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The data is in HDFS After the increment is ordered , Realized HDFStreamingSource, It does... For files Fecher Partition , For every file there is Fecher The thread of , And each Fecher The thread counts every file . it offset Processed cursor progress , The status will be changed according to the Checkpoint The process of , Update it to State among .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In this way, we can realize the orderly promotion of the whole file consumption . When looking back at historical data , Offline jobs involve stopping the entire job . It's actually in the whole FileFetcher To introduce a partition end identifier into the module of , And every thread will count every partition , To sense the end of its partition , The state after the partition ends is finally summarized to cancellationManager, And we'll summarize it further Job Manager To update the progress of the global partition , When all global partitions reach the end cursor , Will bring the whole Flink The homework goes on cancel Shut down .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e2/e2ea9390c05a52b0851403812db22aa6.jpeg","alt":" picture 19","title":" picture 19","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"9.3 offline - Snapshot Dimension table ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" I talked about the whole offline data , In fact, the data are hive On ,hive Of HDFS There will be a lot of information in the whole table field of table data , But when it comes to offline features , The information needed is actually very little , So you need to hive Do offline field clipping first , Put a ODS Clean your watch into DW Table of ,DW My watch will finally pass Flink function Job, Inside there will be a reload Of scheduler, It's going to move forward on a regular basis based on the current data Watermark The partition , Go and pull in hive The table information corresponding to each partition . By downloading something HDFS Of hive Some of the data in the catalog , It'll end up in the whole memory reload become Rocksdb The file of ,Rocksdb In fact, it is used to provide dimension table KV Components of the query .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" There will be multiple components Rocksdb Of build The build process , It mainly depends on the whole process of data flow Eventtime, If you find that Eventtime The push is near the end of the hour partition , Will take the initiative through lazy loading mode reload, Build the next hour Rocksdb The partition , In this way , To switch the whole Rocksdb The read .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0e/0e8694f2114aa774dcf4f7c20c0704fe.jpeg","alt":" picture 1","title":" picture 1","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"10. Experiment flow and batch integration ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In the above three optimizations , That is, partition order increment , class Kafka Partition Fetch consumption , And the dimension table Snapshot On the basis of , Finally, real-time features and offline features are realized , Share a set SQL The plan , Get through the stream batch calculation of features . And then let's look at the whole experiment , Complete flow batch integrated link , It can be seen from the figure that the top granularity is the whole offline complete calculation process . The second is the whole near line process , In fact, the semantics of off-line process is exactly the same as that of near-line process , It's all used Flink To provide SQL Calculated .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Take a look at the near line , Actually Label join It's using Kafka A click stream and a presentation stream of , To the whole offline computing link , Then we use a HDFS Click on the directory and HDFS Show catalog . Feature data processing is the same , Real time uses Kafka Play data for , as well as Hbase Some of the manuscript data . For offline , It's using hive The manuscript data of , as well as hive Play data for . In addition to the whole offline and near line streaming batch get through , The real-time data effect generated by the whole near line is also summarized to OLAP Engine , adopt superset To provide the whole real-time index Visualization . In fact, it can be seen from the figure that the complete complex flow batch integrated computing link , The computing nodes are very complex and numerous .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a9/a9fce6495111fdbbd9845bea3f33e0f0.jpeg","alt":" picture 2","title":" picture 2","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"11. Experimental collaboration - Challenge ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In the next stage, the challenge is more about experimental collaboration , The figure below is an abstraction after simplifying the whole link . As you can see from the diagram , In the area box with three dashed lines , They are offline links and two real-time links , Three complete links make up the stream batch of jobs , In fact, it is the most basic process of workflow . It needs to complete the complete abstraction of workflow , Including the driving mechanism of streaming batch Events , as well as , For the algorithm in AI I hope to use more Python To define the complete flow, In addition, the entire input , The output and its whole calculation tend to be templated , This will facilitate the cloning of the whole experiment .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4e/4eccfd6abd1fa637fae5aff68254a910.jpeg","alt":" picture 3","title":" picture 3","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"12. introduce AIFlow","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In the second half of the year, the whole workflow is more about cooperation with the community , Introduced AIFlow The whole set of solutions .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The right side is actually the whole AIFlow Full link DAG View , You can see the entire node , In fact, there are no restrictions on the types it supports , It can be a streaming node , It can also be an offline node . Besides , The edge of communication between nodes can support data-driven and event driven . introduce AIFlow The main advantage of this is ,AIFlow Offer based on Python Semantics to facilitate the definition of complete AIFlow The workflow of the , It also includes the schedule of the whole workflow .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" On the edge of the node , Compared with some of the original industry Flow programme , He also supports the whole mechanism based on event driven . The advantage is that it can help in two Flink Between assignments , adopt Flink among watermark Process the progress of data partition to send an event driven message to pull up the next offline or real-time job .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In addition, it also supports some supporting services around , Some message module services including notification , And metadata services , And in AI Some model centric services in the domain .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/90/90c7a8db70c91a14eca4c8ec0c8b367f.jpeg","alt":" picture 4","title":" picture 4","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"13. Python Definition Flow","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Let's look at it based on AIFlow How is it ultimately defined as Python The workflow of the . The view on the right is the definition of a complete workflow for an online project . First of all 、 It's the whole thing Spark job The definition of , Through configuration dependence To describe the entire downstream dependency , It will send an event driven message to pull up the following Flink Flow homework . Streaming jobs can also pull up the following through message driven mode Spark Homework . The definition of the whole semantics is very simple , It only takes four steps , Configure per node confg Information about , And define each node's operation act , And its dependency Dependence , Finally, run the whole flow The topology view of .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/be/be3732cd75d7ce1a63618db02df6a80a.jpeg","alt":" picture 6","title":" picture 6","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"14. Based on event driven stream batch ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Next, let's take a look at the complete driver mechanism of stream batch scheduling , On the right side of the figure below is a complete driver view of the three work nodes . The first is from Source To SQL To Sink. The yellow box introduced is extended supervisor, He can collect the whole watermark speed of progress . When the whole streaming job finds watermark It's time to move on to the next hour's partition , It will send a message , Go and get NotifyService.NotifyService After getting this message , It will go to the next assignment , The next assignment will be in the whole Flink Of DAG To introduce flow Of operator,operator Before receiving the message from the last assignment , It blocks the whole job . Until after receiving the message driver , On behalf of the upstream, in fact, the partition has been completed in the last hour , And then the next flow Nodes can be driven and pulled up to work . Again , The next workflow node also introduces GlobalWatermark Collector To collect the progress of its processing . When the partition was finished in the last hour , It will also send a message to NotifyService,NotifyService This message will be used to drive the call AIScheduler Module , So as to pull up spark Do offline homework spark The end of offline . You can see from it that , The whole link actually supports batch to batch , Batch to stream and stream to stream , And four scenarios of streaming to batch .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ae/ae8171ca14c355231bd64e57952ef3ab.jpeg","alt":" picture 7","title":" picture 7","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"15. real time AI The rudiment of full link ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Throughout the flow and batch flow Based on definition and scheduling , stay 2020 The real-time AI The rudiment of full link , The core is experiment oriented . Algorithm students can also be based on SQL To develop Node The node of ,Python It can be completely defined DAG workflow . monitor , Alarm and operation and maintenance are integrated .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" meanwhile , Support the connection from offline to real time , From data processing to model training , From model training to experimental effect , And end to end access . On the right is the link of the whole near line experiment . The following is the service of providing the material data of the whole experimental link to the online prediction training . As a whole, there will be three supporting aspects :","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" First, some basic platform functions , Including experiment management , Model management , Feature management and so on ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Secondly, it includes the whole AIFlow Some at the bottom service Service for ;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" And then there are some platform level metadata Metadata services for .","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/15/15df35c75c4824ee339f1eff1f45d789.jpeg","alt":" picture 8","title":" picture 8","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":" Four 、 Some prospects for the future ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In the coming year , We will focus more on two aspects .","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The first is the direction of the data lake , It's going to focus on ODS To DW Some incremental computing scenarios of layer , as well as DW To ADS The breakthrough of some scenes of the layer , The core will combine Flink Add Iceberg as well as HUDI As a landing in that direction .","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In real time AI On the platform , We will go further to experiment oriented to provide a set of real-time AI Collaboration platform , The core is to create efficient , An engineering platform for people who can refine and simplify algorithms .","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ac/acdd5705e93dbba0bd7b5518d6a64f0d.jpeg","alt":" picture 10","title":" picture 10","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
Please bring the original link to reprint ,thank
Similar articles

2021-08-09

2021-08-09