Scheduling algorithm : mapreduce When there's a lot of work going on , In what order is it executed ?

Scheduling algorithm order needs attention :
1. Improve the throughput of jobs .
2. Consider priorities .
Three kinds of schedulers : If you can't finish your homework , And the utilization rate of machine resources is relatively low , At this time, we can consider these things
1.FifoScheduler, Default scheduling algorithm , The first in first out method is used to deal with applications , There is only one queue to submit applications , No application priority can be configured .
2.CapacityScheduler, Capacity scheduler . Multiple queues , Rely on homework , If the demand for resources is less , The priority will be higher , More resources are needed , The priority will be lower .
3.FairScheduler: Fair scheduler , Multiple queues , Multi user sharing resources . When the program is running, it can set the priority on the client , You can also set preemption .
The easiest way to use it :
1. To configure FairScheduler:
Modify the configuration file mapred-site.xml, Then restart the cluster
More configuration in :conf/fair-scheduler.xml
<!-- <value>org.apache.hadoop.mapred.JobQueueTaskScheduler</value> -->
2. To configure CapacityScheduler:
Modify the configuration file mapred-site.xml, The capacity scheduler is multi queued , Specified , There is a default ,default
More configuration in :conf/capacity-scheduler.xml

mapredece Of job tuning :  Focus on profiles mapred-default.xml, Job tuning depends on the specific environment

Think about it in two ways , One is map Stage , One is reduce Stage
One file makes one map Mission , If there is a lot of small data , There will be a lot of tasks running , It wastes resources , have access to SequenceFile. our map The tasks and reduce The task itself is java process , We're starting map The tasks and reduce In terms of mission , It's actually starting java process , It costs a lot of resources . Give a lot of small files to a map To deal with it , You can customize CombineFileInputFormat.
It's not quite efficient to synthesize small text into a large file , Small files on five different machines with one map To deal with it , It is bound to produce network transmission . It will take a lot of time at this time , But this is one per file map For example, the processing time should be less . The best way is not to be in hdfs Put a lot of small files in it .
Speculative execution : When the job is running , If some are slow , The program will worry about whether the task is relatively slow , It will affect the operation of the whole job , Then start the same task again to run . That is to say, for the same data source, there are two tasks running at the same time , Whoever ends first will use , Whoever ends first will use his , There's no end to killing .
It is speculated that execution should be shut down on the whole cluster , The specific required jobs are opened separately , You can generally save 5%-10% Cluster resources of .;
mapred.reduce.task.speculative.execution=false ;
It can be turned on when resources are not tight .
Turn on JVM reusing :
To start a process is to start a new one jvm, In order to save resources , Turn on the reuse of virtual machines , You don't have to shut down the virtual machine , I'll start again here map, Rise again reduce That's all right. , It's called opening jvm Reuse of .
increase InputSplit Size :
Because a InputSplit Representing one map Mission ,InputSplit There are fewer of them , that map There are fewer tasks , Execution takes less resources , If you put InputSplit After the increase , One map There will be more resources to deal with .InputSplit The quantity is less , however map There's more execution data . our map The task is performed InputSplit Words , It means our map The task has started the process , Initialization is complete , What's left is to reuse the process , increase InputSplit after , The amount of data will increase , So at this point , One map Can handle more data , The throughput comes up .
increase map Cache of output .
map It produces output , The output will be sent to reduce,reduce And then map Before the output of the task is taken away ,map To store... In memory , When our map If you can't fit it in memory , It's going to be on disk ,100 It's not small .map Every write to disk is one IO The operation of . Resources IO It's also a heavyweight operation , So I suggest writing more at a time ,map Output
Add merger spill Number of documents :
spill From :map The output of the task is first put into memory , If you install it in memory, you will write it to disk first , The process of writing to disk is called spill, If the amount of data is large , Is the need to spill Many times , This is what happens on the disk A lot of small files , This is a lot of small files , Give Way reduce You can't take it from here , that map The client will merge these small files on the disk , Synthesize a file . The process of synthesis is merging , This operation of disk is the same as writing to disk , If it's more than one merger , There will be fewer mergers , Disk merging consumes memory , It will also cost cpu Of , So there's map End operation , Here's more spill operation , And merge operations , So more resources should be left to map, Not for the rest of us spill, These processes are indispensable , So we can only make such a process occupy as few resources as possible , That is to say, the running time should be as short as possible . More output to disk at one time , When merging , More at a time .
map End output compression , recommend LZO Compression algorithm :
mapred.compress.,ap.output=true ;;
increase shuffle Number of replication threads :
map The output is sent to reduce This process at the end is called shuffle,shuffle The time is reduce The client initiates the request ,map End pass http Copy the past , When copying, there are more concurrent threads , It means copying multiple copies at the same time , In other words, the throughput has gone up . increase shuffle Execution speed , Reduce shuffle Running time . The amount of concurrency has gone up , It means that the network takes up more disk bandwidth , In this case , The Internet is generally OK , Because the number of concurrent threads in the Intranet has increased , The impact is not too big , So this thing needs to be improved .
Set up a single node map,reduce Number of execution , The default value is 2
The task is changed too much , It means that multiple tasks can be performed on a single machine in parallel . Will increase the memory footprint .;

