Practice of Du Xiaoman financial big data architecture

DataFunTalk 2021-10-14 06:47:24


Sharing guests : Zhao Hui   Du Xiaoman Finance Architects

Edit and organize : Jiang Wenjuan   Jiageng College of Xiamen University

Production platform :DataFunTalk

Reading guide : Big data architecture faces many challenges in the financial scenario , Architecturally , Business data processing 、 The full link service of storage and use puts forward more detailed management and control requirements ; In terms of use , Users do not want to understand the specific implementation and control details of big data architecture , Users just want to lower the threshold 、 Faster ways to use products ; In terms of Management , The company hopes to process data 、 Relevant experience in the processing process shall be effectively inherited .


This paper mainly shares the corresponding solutions to the above problems , Namely :①  Big data architecture based on Baidu cloud products ——MMR, Control requirements ;②  Du Xiaoman data Lake Management and analysis platform —— swan , Lower the threshold ;③  Du Xiaoman model training monitoring and evaluation system —— Yichuang , Experience inheritance .

Big data cloud architecture ——MMR

Du Xiaoman big data cloud architecture is based on Baidu cloud big data products , Baidu cloud standard big data product solutions are similar to open source big data solutions . The first is to submit tasks through users , Enter the computing layer , Undertake computing needs . Then to the storage tier , Undertake data storage requirements . In order to meet the needs of more detailed management and control , We extended the architecture .


We divide the architecture into the following parts : Access layer 、 Surface control layer 、 Computing layer 、 Virtual storage tier 、 Physical storage tier .

1. User level

The user layer mainly implements the control from user operations to people . The specific implementation methods are mainly : We will transform the user's big data entrance , Get through Xiaoman's employee management system , Users of big data services will mark their identity when logging in to the operating machine , When submitting actions and commands , You can identify individuals architecturally , thus , When submitting commands or operations, you operate with your personal identity , All tasks and user actions can be located to the specific responsible person .

2. Table control management

For table control management , It meets the business requirements of partial sharing of structured data , That is, big data stores data in Hive Table based ,Hive There may be a hundred in the watch 、 Hundreds or thousands of fields , Different fields have different security level requirements . for example , stay 100 In a field , Only 20 The fields you want to share , rest 80 You don't want to share , In this case, you need to perform field level permission control on the table . Based on this , We will establish a targeted authority control center in the outer layer , Users can label the table at the field level and set the permissions for sharing and Application on the platform . In this way , Users submit tasks to Hive Server or Spark Server when , The service layer will have a logic to verify whether the field required by the task or operation submitted by the user also has the permission of the field , This is used to decide whether to implement field level permission control .

3. Computing layer

In the computing layer, it is mainly the control of resources , It mainly depends on the ability of Baidu cloud infrastructure . In the computing layer and storage layer , Du Xiaoman set up a virtual management layer , The virtual management layer mainly solves the sharing and isolation requirements of unstructured data . generally speaking , The segmentation direction of each business is private , But the data processing team of each business , Both upstream and downstream , Will face the needs of partial data sharing and use . In this case , We control permissions at the directory level . On the basis of directory permission control , At the same time, it is agreed that the user can access IP、IP Segment can achieve more subtle control . Realize a certain degree of data sharing on the basis of ensuring business isolation , So as to ensure the operation of all data 、 Use is controllable , All processes are auditable .


On this basis , We also face a bigger problem —— The business is from Baidu architecture to the current Baidu cloud open source data architecture , It is similar to the big data architecture from closed source to open source . Although the calculation logic or calculation method is roughly the same , But in many details, such as entrance design 、 Usage habits and functional experience are inconsistent . To solve the difference :

  • First , We should unify the usage habits of users , Assemble all tools used by users to access big data services into a unified Client, In unity client Automatic smoothing of differences . Users' work in the process from Baidu architecture to Baidu Cloud Architecture to migration is mainly to modify the configuration 、 The verification results , It doesn't involve code level modifications .

  • secondly , Construction of virtual storage layer 、 Achieve storage tier compatibility , Access the object store in the way and habit of the file system , At the user level, the function and experience are consistent .

Intelligent scheduling & High availability , In Baidu, the resource pool and the available range of resources are larger , On the cloud, users can use as many as they apply . therefore , The congestion at the peak of traffic is more serious , Therefore, the parameters of the task should be adjusted first , Avoid waste of resources , At the same time, window scheduling is implemented for all tasks 、 Dynamically adjust the execution time of tasks , In order to alleviate the congestion of the whole traffic peak . On schedule 、 Based on security and compatibility , Also made some high availability level support , It is mainly used by all users agent Keep your heart beating . In this way , You can understand the user's machine environment 、 The job environment used to submit the task 、 Types of clusters . When the user feeds back that a task fails , It can locate and find problems very quickly .

Data Lake Management and analysis platform —— Honghu platform

1. agile 、 Intelligent global data management


The target users of the data Lake Management and analysis platform are policy analysts , The requirements of policy analysts for the whole platform architecture are simple and clear : Simple and easy to use .

The first step is to reduce the cognitive threshold of users , That is, anyone can find and understand data very quickly , And more efficient application and use of data . The specific scheme is as follows :

  • Unified metadata management , Collect metadata information of different storage systems , Conduct unified platform presentation , Solve metadata islands , It realizes the unification of metadata in user perception layer .

  • Topic domain construction and intelligent recommendation , Through the construction of subject domain , Label classification of data , Reduce the cost of user data understanding . Aiming at the problem of data table expansion , An intelligent recommendation system based on data heat and data quality is constructed , Reduce the difficulty of user data retrieval .

  • Control data quality , Strictly control the quality of the data entering the circulation , Output data quality report , Address user concerns about data availability .

  • Data value analysis , Through data, blood 、 Production task chain , Accurately control the value of data , It improves users' understanding of data .

  • Authority control , Desensitization and encryption management , Ensure data security during data flow ; Release and application process control , Ensure that data is shared and used according to business needs .

2. Multi engine visual drag and drop batch & Stream development platform


The second step is to reduce the user's use threshold , A multi engine visual drag and drop batch system is established & Stream development platform . The specific scheme is as follows :

  • Data integration , Also known as data exchange 、 Data exchange center . We built a data exchange platform , Users only need to configure simple tasks on the data interaction platform , Data acquisition can be realized without coding .

  • Visual integration IDE, Support syntax detection 、 Highlighted detection 、 Format adjustment ; Support one click deployment and operation of tasks ; Support Hive、Spark、Flink、GP、Shell And other development models .

  • Drag and drop scheduling , Intuitively and effectively show the dependencies between tasks ; Multi dimensional monitoring analysis ensures the correct execution of tasks .

  • Data analysis , The analysis requirements presented by the platform support similar greenplum etc. OLTP Type of analysis engine . Temporary analysis needs support presto Yes Hive Watch or gp Joint analysis of data in .

  • data API, Support data analysis and result confirmation , One click effect to the online system .

3. A one-stop data analysis platform for the whole data life cycle


The one-stop data analysis platform of the whole data life cycle greatly reduces the use threshold of analysts .

Model training monitoring and evaluation system —— Yichuang

For modelers , We have built a model training evaluation system platform —— Yichuang .

1. The whole process of one click model training


One click model training effectiveness system in the whole process : Through standardization and templating , Unified model training and validation process , Improved data and model quality 、 Ensure the online and offline consistency .

  • Unified management of code and environment , The model and feature processing process have unified standards and specifications .

  • Standardization of sample characteristics , Build a general sample library 、 Feature library 、 The problem of data caliber in the training process is solved .

  • Model training standardization , Through a dedicated cluster , Standardized training process , The model training efficiency is guaranteed , Through the standardization of processing operators and update sharing mechanism , Improve the overall model effect .

  • Evaluate deployment standardization , Through online feature library 、 Construction of offline feature library and verification mechanism , It solves the problem of online and offline consistency of the model .

2. Plug in and component-based model evaluation framework


Plug in and component-based model evaluation framework , While broadening the evaluation dimension of the model , Unified the caliber of model evaluation , Smoothed out the effect of the model and cognitive differences .

expectation & Q & A
1. expectation
Let's introduce our outlook for the next stage :
  • Embrace cloud native , Break the barriers to resources

  • Analysis framework of Lake Warehouse Integration , Speed up the iteration of the business

  • Optimize the one-stop platform of big data and release the value of the whole data

2. Q & A

Q: How online and offline features are consistent ?

A: Distinguish between online feature library and offline feature library , The offline feature library controls the quality of features , The online feature library keeps the latest image of the offline feature library ; Model training is based on off-line feature library , Online scoring is based on online feature library ; The process of feature from offline feature library to online feature library , Confirm the multi-dimensional effect .

Today's sharing is here , Thank you. .

At the end of the article 、 give the thumbs-up 、 Looking at , Give me one 3 Combo ~

Sharing guests :


Community recommendation :

Welcome to join  DataFunTalk big data   Communication group , Zero distance communication with peers . Identify QR code , Add small assistant wechat , The group of .


About us :

DataFunTalk  Focus on big data 、 The sharing and exchange of artificial intelligence technology application . Initiated in 2017 year , In Beijing, 、 Shanghai 、 Shenzhen 、 Hangzhou and other cities hold more than 100+ Offline and 100+ Online Salon 、 Forum and Summit , Have invited close to 1000 Experts and scholars to share . His official account DataFunTalk Cumulative production of original articles 500+, One million + read ,11 ten thousand + Precision fans .
Focus on big data 、 The sharing and exchange of artificial intelligence technology application . Committed to millions of data scientists . Organize technology sharing live broadcast regularly , And sort out big data 、 recommend / search algorithm 、 Advertising algorithm 、NLP Natural language processing algorithm 、 Intelligent risk control 、 Autopilot 、 machine learning / Deep learning and other technology applications .
548 Original content
official account

🧐 Share 、 give the thumbs-up 、 Looking at , Give me one 3 Double Hit Chant !

Please bring the original link to reprint ,thank
Similar articles