This article takes you to understand MapReduce of big data technology

ZSYL 2021-09-15 08:53:56

1. MapReduce Definition

MapReduce It's a Distributed computing programs Programming framework of , It's user development “ be based on Hadoop Data analysis application ” The core framework of .

MapReduce The core function is The business logic code written by the user and Comes with default components Integrate into a complete Distributed computing programs , Running concurrently in a Hadoop On the cluster .

2. MapReduce Advantages and disadvantages

2.1 advantage

1)MapReduce Easy to program

It simply implements some interfaces , You can complete a distributed program , This distributed program can be distributed to a large number of cheap PC Machine running .

That means you write a Distributed programs , It's as like as two peas . It is because of this characteristic that MapReduce Programming has become very popular .

2) Good scalability

When your computing resources can't be met , You can go through Simply add machines To expand its computing power .

3) High fault tolerance

MapReduce The original intention of the design is to enable the program to be deployed in a cheap PC On the machine , This requires it to have a high Fault tolerance .

For example, one of the machines is down , It can transfer the above computing tasks to another node to run , Don't let this task fail , And this process doesn't require human intervention , And it's all about Hadoop Internally completed .

4) fit PB Off line processing of massive data at or above level

Can achieve thousands of server cluster concurrent work , Provide data processing capabilities .

2.2 shortcoming

1) Not good at real-time computing

MapReduce Not like MySQL equally , Returns results in milliseconds or seconds .

2) Not good at streaming Computing

The input data of flow calculation is dynamic , and MapReduce The input data set of is static , Can't change dynamically .

This is because MapReduce Its own design characteristics determine that the data source must be static .

3) Not good at DAG( Directed acyclic graph ) Calculation

Multiple applications exist Dependency relationship , The input of the latter application is the output of the former .

under these circumstances ,MapReduce It's not that you can't do it , But after use , Every MapReduce The output of the job is written to disk , It's going to create a lot of disks IO, It leads to very low performance .

3. MapReduce The core idea

 Insert picture description here
(1) Distributed computing programs often need to be divided into at least 2 Stages .

(2) First stage MapTask Concurrent instance , Full parallel operation , Irrelevant .

(3) Second stage ReduceTask Concurrent instances are irrelevant , But their data depends on everything from the previous stage MapTask Output of concurrent instances .

(4)MapReduce The programming model can only contain one Map Stage and a Reduce Stage , If the user's business logic is very complex , There's only one MapReduce Program , Serial operation .

summary : analysis WordCount Data flow towards deep understanding MapReduce The core idea .

4. MapReduce process

A complete MapReduce There are three types of instance processes in distributed runtime :

(1)MrAppMaster: Responsible for process scheduling and state coordination of the whole program .

(2)MapTask: be responsible for Map The whole data processing flow of the stage .

(3)ReduceTask: be responsible for Reduce The whole data processing flow of the stage .

5. official WordCount Source code

Decompile the source code with decompile tool , Find out WordCount The cases are Map class 、Reduce Class and driver class . And the type of data is Hadoop Self encapsulated serialization type .

6. Common data serialization types

 Insert picture description here

7. MapReduce Programming specification

The program written by the user is divided into three parts :Mapper、Reducer and Driver.

1.Mapper Stage

(1) User defined Mapper To inherit your own parent class

(2)Mapper The input data for is KV On the form of (KV The type of is customizable )

(3)Mapper The business logic in is written in map() In the method

(4)Mapper The output data of is KV On the form of (KV The type of is customizable )

(5)map() Method (MapTask process ) For each <K,V> Call once

2. Reducer Stage

(1) User defined Reducer To inherit your own parent class
2.Reducer Stage

(2)Reducer The input data type of corresponds to Mapper The type of output data , It's also KV

(3)Reducer The business logic of is written in reduce() In the method

(4)ReduceTask The process is the same for each group k Of <k,v> Group call once reduce() Method

3. Driver Stage

amount to YARN Cluster clients , Used to submit our entire program to YARN colony , What is submitted is that it encapsulates MapReduce Program related operation parameters job object

8. WordCount Case practice

Please refer to :【MapReduce】WordCount Case practice

come on. !

thank !

Strive !

Please bring the original link to reprint ,thank
Similar articles