One 、Azkaban Introduce

Azkaban yes LinkedIn Open source task scheduling framework , Be similar to JavaEE Medium JBPM and Activiti Workflow framework .

Azkaban Functions and features :

1, Task dependency handling .

2, Mission monitoring , Failure alarm .

3, Visualization of task flow .

4, Task authority management .

Common task scheduling frameworks include Apache Oozie、LinkedIn Azkaban、Apache Airflow、Alibaba Zeus, because Azkaban It is light and pluggable 、 Amicable WebUI、SLA The alarm 、 Perfect authority control 、 It is easy to redevelop and so on , It has also been widely used . The following figure for Azkaban The architecture of the figure , There are mainly three parts :Azkaban Webserver、Azkaban Executor、 DB.

Webserver Mainly responsible for authority verification 、 project management 、 Work flow distribution, etc ;

Executor Mainly responsible for operation flow / The specific implementation of the job and the collection of execution logs ;

MySQL For storing jobs / Execution status information of job flow . The picture shows the single executor scene , But in practice, most of the projects use more executor scene .

1.1 Job flow execution process

Azkaban webserver It will be collected according to Executor Select an appropriate task running node , And push the task to the node , Manage and run all of the job.

1.2 Deployment mode

Azkaban Three deployment modes are supported , For learning and testing respectively , High availability deployment .

solo-server Pattern

DB It uses an embedded H2,Web Server and Executor Server Running in the same process . This pattern includes Azkaban All features of , But it's generally used for learning and testing .

two-server Pattern

DB It uses MySQL,MySQL Support master-slave framework ,Web Server and Executor Server Running in different processes .

Distributed multiple-executor Pattern

DB It uses MySQL,MySQL Support master-slave framework ,Web Server and Executor Server Running on different machines , And there are many Executor Server.

1.3 Compile deployment
Compile environment
yum install git
yum install gcc-c++
yum install java-1.8.0-openjdk-devel
Download the source code & decompression
mkdir –p /data/azkaban/install
cd /data/azkaban
mv 3.42.0.tar.gz azkaban-3.42.0.tar.gz
tar -zxvf azkaban-3.42.0.tar.gz
cd azkaban-3.42.0
./gradlew build installDist -x test
solo-server Deployment mode

The following is a simple test for deployment , use solo-server Pattern to deploy .

cd /data/azkaban/install tar -zxvf ../azkaban-3.42.0/azkaban-solo-server/build/distributions/azkaban-solo-server-0.1.0-SNAPSHOT.tar.gz -C .
Change the time zone
cd /data/azkaban/install/azkaban-solo-server-0.1.0-SNAPSHOT
tzselect # choice Asia/Shanghai
vim ./conf/ # Change the time zone 

notes : start-up / Closing must go to /data/azkaban/install/azkaban-solo-server-0.1.0-SNAPSHOT/ Catalog .

Sign in


See configuration for details of monitoring port ./conf/

IP For server address .

See configuration for user name ./conf/azkaban-users.xml, have admin The user name of the role is azkaban, The password is azkaban:

For details of configuration methods, see :

Two 、Azkaban Network interconnection with the data warehouse cluster

at present Azkaban With cloud products Snova Network interworking is based on two facts :1,Azkaban Executor The server can access the Internet or Snova The service side IP.2,Snova Provide Internet IP Ability to visit . The figure below shows the network connection diagram :

Azkaban Executor Running in execution job when , Its script or command is through the public network IP visit Snova.

Next, I will explain in steps how to base on Azkaban The workflow of the .

3、 ... and 、 Preparatory work

3.1 Snova Clusters create extranets IP

stay Snova Cluster console , Basic configuration page , Click on “ Apply for an Internet address ”, Wait for a successful run , You will see the external network accessing the cluster IP Address .

3.2 add to Snova Visit address white list

stay Snova Console , Cluster details page , Configuration page , The new white list is as follows .

Why build this visit white list ?

For system safety ,Snova The default is to deny addresses that are not in the white list or users access the database .

I.e. configuration IP White list CIDR The address is xx.xx.xx.xx/xx, Include all Azkaban Executor All of the IP Or network segment .

3.3 User authorization

stay 3.2 chapter in , It is recommended to create a single user for SCF Task scheduling and computing . Therefore, the user needs to be authorized to access the corresponding database and table permissions .

Create user
CREATE USER scf_visit WITH LOGIN PASSWORD 'scf_passwd';

And set the user access password .

Database table Authorization
GRANT ALL on t1 to scf_visit;

Four 、 Scheduled task


Sign in Azkaban,Create Project=>Upload Generated in the previous step zip package =>execute flow Step by step .

See Reference documents :

4.1 Create a project

4.2 establish job

file name :job.job, Must be .job ending . The contents are as follows :

command=echo "job1"

notes :type See... For type and usage

command=echo "job2 xx"
command.1=ls –al

notes :dependencies For the sake of job Dependent task file name ( barring .job suffix ). If you rely on more than one , Comma separated , Such as job2,job5.

command=sleep 60
command=sh /data/shell/ psqlx

among /data/shell/ , Attention function can encapsulate user function code , The script is as follows , Read the data in the table , And print :

function psqlx() { result=PGPASSWORD=scf_passwd psql -h xx.xx.xx.xx -p xx -U scf_visit -d postgres <<EOF select * from t1; EOF echo $result }
4.3 Upload job Compressed package

Compress all job File to a zip In bag . Be careful : All files must be in the root directory of the compressed package , No subdirectories , as follows :

4.3 function

Query execution process and results .

4.4 Set cycle schedule

After successful debugging , You can set up a periodic schedule , For example, schedule the workflow regularly every day , Complete the operation plan .

5、 ... and 、 practice

For the two most popular schedulers on the market , Give a detailed comparison of . Well known should be Apache Oozie.

5.1 contrast

Compare from the function

Both can be scheduled linux command 、mapreduce、spark、pig、java、hive、java Program 、 Script workflow tasks

Both can perform workflow tasks on a regular basis

Compare from workflow definition

1、Azkaban Use Properties File definition workflow

2、Oozie Use XML File definition workflow

Compare from the work spread

1、Azkaban Support direct parameter transfer , for example ${input}

2、Oozie Support parameters and EL expression , for example ${fs:dirSize(myInputDir)}

Compare from timing execution

1、Azkaban The timing of tasks is based on time

2、Oozie The scheduled execution task of is based on time and input data

In terms of resource management

1、Azkaban Have more strict authority control , If the user reads the workflow / Write / To perform, etc

2、Oozie There is no strict authority control

5.2 Application scenarios

Data analysis can be summarized as three steps : One 、 Data import . Two 、 Data calculation . 3、 ... and 、 Export data .

Three types of tasks may be run concurrently , And the task depends on . therefore Azkaban Basically, it can meet the above requirements of task scheduling management and operation scenarios .

First create a job1, For user data import , For instance from cos Import , The tasks are as follows SQL command .

insert into gp_table select * from cos_table;

Data can also be imported through other import tools , Such as DataX Periodically import data from other databases Snova In the data warehouse . So just put DataX Deploy to Azkaban Executor The corresponding catalogue of the machine , And make the call

secondly , establish job2, User data calculation and Analysis . This step can be multiple job The result of multiple runs , It can also run concurrently .

Last , The calculation results can be delivered to the application database .

insert into cos_table select * from gp_table;

5.2 Insufficient

1,Azkaban at present Job Granular failure retry understanding is relatively complex , stay Projects->Executions Find the corresponding failed Id, Select the execution instance ID, Enter details , Click rerun , A new workflow instance will be generated ID, Instead of rerunning the original failed instance ID, New examples ID From the failed job Began to run , Has successfully run the direct skip , No longer running .

2,job adopt shell Command to start a complex program ,shell Return to success , It doesn't mean the program runs successfully .

3,job Lack of fault tolerance in operation management , When one job After submitting a run task , Restart at this time or executor Process hangs up , The task will fail in state , The actual task may have run successfully .

