MapReduce It's a Distributed computing programs Programming framework of , It's user development “ be based on Hadoop Data analysis application ” The core framework of .
MapReduce The core function is The business logic code written by the user and Comes with default components Integrate into a complete Distributed computing programs , Running concurrently in a Hadoop On the cluster .
1）MapReduce Easy to program
It simply implements some interfaces , You can complete a distributed program , This distributed program can be distributed to a large number of cheap PC Machine running .
That means you write a Distributed programs , It's as like as two peas . It is because of this characteristic that MapReduce Programming has become very popular .
2） Good scalability
When your computing resources can't be met , You can go through Simply add machines To expand its computing power .
3） High fault tolerance
MapReduce The original intention of the design is to enable the program to be deployed in a cheap PC On the machine , This requires it to have a high Fault tolerance .
For example, one of the machines is down , It can transfer the above computing tasks to another node to run , Don't let this task fail , And this process doesn't require human intervention , And it's all about Hadoop Internally completed .
4） fit PB Off line processing of massive data at or above level
Can achieve thousands of server cluster concurrent work , Provide data processing capabilities .
1） Not good at real-time computing
MapReduce Not like MySQL equally , Returns results in milliseconds or seconds .
2） Not good at streaming Computing
The input data of flow calculation is dynamic , and MapReduce The input data set of is static , Can't change dynamically .
This is because MapReduce Its own design characteristics determine that the data source must be static .
3） Not good at DAG（ Directed acyclic graph ） Calculation
Multiple applications exist Dependency relationship , The input of the latter application is the output of the former .
under these circumstances ,MapReduce It's not that you can't do it , But after use , Every MapReduce The output of the job is written to disk , It's going to create a lot of disks IO, It leads to very low performance .
（1） Distributed computing programs often need to be divided into at least 2 Stages .
（2） First stage MapTask Concurrent instance , Full parallel operation , Irrelevant .
（3） Second stage ReduceTask Concurrent instances are irrelevant , But their data depends on everything from the previous stage MapTask Output of concurrent instances .
（4）MapReduce The programming model can only contain one Map Stage and a Reduce Stage , If the user's business logic is very complex , There's only one MapReduce Program , Serial operation .
summary ： analysis WordCount Data flow towards deep understanding MapReduce The core idea .
A complete MapReduce There are three types of instance processes in distributed runtime ：
（1）MrAppMaster： Responsible for process scheduling and state coordination of the whole program .
（2）MapTask： be responsible for Map The whole data processing flow of the stage .
（3）ReduceTask： be responsible for Reduce The whole data processing flow of the stage .
Decompile the source code with decompile tool , Find out WordCount The cases are Map class 、Reduce Class and driver class . And the type of data is Hadoop Self encapsulated serialization type .
The program written by the user is divided into three parts ：Mapper、Reducer and Driver.
（1） User defined Mapper To inherit your own parent class
（2）Mapper The input data for is KV On the form of （KV The type of is customizable ）
（3）Mapper The business logic in is written in map() In the method
（4）Mapper The output data of is KV On the form of （KV The type of is customizable ）
（5）map() Method （MapTask process ） For each <K,V> Call once
2. Reducer Stage
（1） User defined Reducer To inherit your own parent class
（2）Reducer The input data type of corresponds to Mapper The type of output data , It's also KV
（3）Reducer The business logic of is written in reduce() In the method
（4）ReduceTask The process is the same for each group k Of <k,v> Group call once reduce() Method
3. Driver Stage
amount to YARN Cluster clients , Used to submit our entire program to YARN colony , What is submitted is that it encapsulates MapReduce Program related operation parameters job object
Please refer to :【MapReduce】WordCount Case practice
come on. !