Scheduling algorithm : mapreduce When there's a lot of work going on , In what order is it executed ?

Scheduling algorithm order needs attention :
1. Improve the throughput of jobs .
2. Consider priorities .
Three kinds of schedulers : If you can't finish your homework , And the utilization rate of machine resources is relatively low , At this time, we can consider these things
1.FifoScheduler, Default scheduling algorithm , The first in first out method is used to deal with applications , There is only one queue to submit applications , No application priority can be configured .
2.CapacityScheduler, Capacity scheduler . Multiple queues , Rely on homework , If the demand for resources is less , The priority will be higher , More resources are needed , The priority will be lower .
3.FairScheduler: Fair scheduler , Multiple queues , Multi user sharing resources . When the program is running, it can set the priority on the client , You can also set preemption .
The easiest way to use it :
1. To configure FairScheduler:
Modify the configuration file mapred-site.xml, Then restart the cluster
More configuration in :conf/fair-scheduler.xml
<property>
<name>mapred.jobtracker.tasktracker</name>
<!-- <value>org.apache.hadoop.mapred.JobQueueTaskScheduler</value> -->
<value>org.apache.hadoop.mapred.FairSchedler</value>
</property>
2. To configure CapacityScheduler:
Modify the configuration file mapred-site.xml, The capacity scheduler is multi queued , Specified , There is a default ,default
More configuration in :conf/capacity-scheduler.xml
<property>
<name>mapred.jobtracker.tasktracker</name>
<value>org.apache.hadoop.mapred.CapacityTaskScheduler</value>
</property>
<property>
<name>mapred.queue.names</name>
<value>default</value>
</property>

mapredece Of job tuning :  Focus on profiles mapred-default.xml, Job tuning depends on the specific environment

Think about it in two ways , One is map Stage , One is reduce Stage
One file makes one map Mission , If there is a lot of small data , There will be a lot of tasks running , It wastes resources , have access to SequenceFile. our map The tasks and reduce The task itself is java process , We're starting map The tasks and reduce In terms of mission , It's actually starting java process , It costs a lot of resources . Give a lot of small files to a map To deal with it , You can customize CombineFileInputFormat.
It's not quite efficient to synthesize small text into a large file , Small files on five different machines with one map To deal with it , It is bound to produce network transmission . It will take a lot of time at this time , But this is one per file map For example, the processing time should be less . The best way is not to be in hdfs Put a lot of small files in it .
Speculative execution : When the job is running , If some are slow , The program will worry about whether the task is relatively slow , It will affect the operation of the whole job , Then start the same task again to run . That is to say, for the same data source, there are two tasks running at the same time , Whoever ends first will use , Whoever ends first will use his , There's no end to killing .
It is speculated that execution should be shut down on the whole cluster , The specific required jobs are opened separately , You can generally save 5%-10% Cluster resources of .
mapred.map.task.speculative.execution=true;
mapred.reduce.task.speculative.execution=false ;
It can be turned on when resources are not tight .
Turn on JVM reusing :
To start a process is to start a new one jvm, In order to save resources , Turn on the reuse of virtual machines , You don't have to shut down the virtual machine , I'll start again here map, Rise again reduce That's all right. , It's called opening jvm Reuse of .
mapred.job.resue.jvm.tasks=-1;
increase InputSplit Size :
mapred.min.split.size=268435456
Because a InputSplit Representing one map Mission ,InputSplit There are fewer of them , that map There are fewer tasks , Execution takes less resources , If you put InputSplit After the increase , One map There will be more resources to deal with .InputSplit The quantity is less , however map There's more execution data . our map The task is performed InputSplit Words , It means our map The task has started the process , Initialization is complete , What's left is to reuse the process , increase InputSplit after , The amount of data will increase , So at this point , One map Can handle more data , The throughput comes up .
increase map Cache of output .
io.sort.mb=300
map It produces output , The output will be sent to reduce,reduce And then map Before the output of the task is taken away ,map To store... In memory , When our map If you can't fit it in memory , It's going to be on disk ,100 It's not small .map Every write to disk is one IO The operation of . Resources IO It's also a heavyweight operation , So I suggest writing more at a time ,map Output
Add merger spill Number of documents :
spill From :map The output of the task is first put into memory , If you install it in memory, you will write it to disk first , The process of writing to disk is called spill, If the amount of data is large , Is the need to spill Many times , This is what happens on the disk A lot of small files , This is a lot of small files , Give Way reduce You can't take it from here , that map The client will merge these small files on the disk , Synthesize a file . The process of synthesis is merging , This operation of disk is the same as writing to disk , If it's more than one merger , There will be fewer mergers , Disk merging consumes memory , It will also cost cpu Of , So there's map End operation , Here's more spill operation , And merge operations , So more resources should be left to map, Not for the rest of us spill, These processes are indispensable , So we can only make such a process occupy as few resources as possible , That is to say, the running time should be as short as possible . More output to disk at one time , When merging , More at a time .
map End output compression , recommend LZO Compression algorithm :
mapred.compress.,ap.output=true ;
mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
increase shuffle Number of replication threads :
mapred.reduce.parallel=15;
map The output is sent to reduce This process at the end is called shuffle,shuffle The time is reduce The client initiates the request ,map End pass http Copy the past , When copying, there are more concurrent threads , It means copying multiple copies at the same time , In other words, the throughput has gone up . increase shuffle Execution speed , Reduce shuffle Running time . The amount of concurrency has gone up , It means that the network takes up more disk bandwidth , In this case , The Internet is generally OK , Because the number of concurrent threads in the Intranet has increased , The impact is not too big , So this thing needs to be improved .
Set up a single node map,reduce Number of execution , The default value is 2
The task is changed too much , It means that multiple tasks can be performed on a single machine in parallel . Will increase the memory footprint .
mapred.tasktracker.map.tasks.maxnum=2;
mapred.tasktracker.reduce.tasks.maxnum=2;

mapreduce Scheduling algorithm and job More articles on tuning

  1. FIFO Scheduling algorithm and LRU Algorithm

    One . theory FIFO: FIFO scheduling algorithm LRU: The most recently unused scheduling algorithm Both are cache scheduling algorithms , Page replacement algorithms that are often used as memory . For example , To help you understand . You have a lot of books , for instance 10000 Ben . Because you have so many books ...

  2. [ Daniel translation series ]Hadoop(8)MapReduce performance tuning : Performance measurement (Measuring)

    6.1  measurement MapReduce And environmental performance indicators The basis of performance tuning is the performance index and experimental data of the system . Based on these indicators and data , To find the performance bottleneck of the system . Performance indicators and experimental data can only be obtained through a series of tools and processes . In this part , General introduction ...

  3. [ Daniel translation series ]Hadoop(11)MapReduce performance tuning : Diagnose general performance bottlenecks

    6.2.4 Task general performance issues This section will introduce those to map and reduce Tasks have performance issues that affect . technology 37 Job contention and scheduler constraints Even if map The tasks and reduce The tasks are tuned , But the whole operation will still be due to environmental reasons ...

  4. hadoop MapReduce - From homework 、 Mission (task)、 From the administrator's point of view

    Hadoop Provides a variety of configurable parameters for user jobs , To allow users to adjust these parameters according to the characteristics of the job, so that the efficiency of the job can be optimized . One Application programming specification 1. Set up Combiner          For a large number of MapReduce ...

  5. Big data technology - MapReduce Of Shuffle And tuning

    Let's learn about this chapter MapReduce Medium Shuffle The process ,Shuffle It happened in map Output to reduce Input process , Its Chinese explanation is “ Shuffle ”, As the name suggests, the process involves the redistribution of data , The main ...

  6. MapReduce Optimization summary and development

    This paper is about <hadoop Technology insider : In depth analysis of MapReduce Architecture design and implementation principle > A Book No 9 Chapter <Hadoop performance tuning > Summary of . chart 1 Hadoop Hierarchical chart Tuning from an administrator's perspective 1. ...

  7. MapReduce Programming practice “ debugging ” and &amp;quot; tuning &amp;quot;

    Content of this article In the last one " First time to know " link , We've been here and Hadoop In the cluster , Successfully implemented several MapReduce Program , Yes MapReduce Programming , I've got the initial understanding . In this article , We are right. M ...

  8. [ Daniel translation series ]Hadoop(16)MapReduce performance tuning : Optimize data serialization

    6.4.6  Optimize data serialization How to store and transfer data has a great impact on performance . In this section, we will introduce the best practices of data serialization , from Hadoop Squeeze out the maximum performance in the process . Compression is Hadoop An important part of optimization . Compression can reduce the number of job outputs ...

  9. [ Daniel translation series ]Hadoop(15)MapReduce performance tuning : Optimize MapReduce Users of JAVA Code

    6.4.5  Optimize MapReduce user JAVA Code MapReduce The way code is executed is the same as normal JAVA Different applications . This is because MapReduce Framework in order to be able to efficiently deal with massive data , It takes millions of calls map and reduc ...

Random recommendation

  1. BeanNameAware Interface and BeanFactoryAware Interface

    so far , What you come into contact with Bean All are “ Unconscious ” Of , It's like the machine factory in the matrix “ farming ” Human , Although they can accomplish certain functions , But I didn't know I was in the factory (BeanFactory) The code in (id), Or where you work ...

  2. CCF festival

    Problem description There is a kind of festival whose date is not fixed , But rather "a Month's Day b A few weeks c" In the form of , For example, mother's Day is set on the second Sunday of May every year . Now? , Here you are. a,b,c and y1, y2(1850 ≤ ...

  3. Microsoft IOC Containers Unity Simple code examples 3- Contract based automatic registration mechanism

    @( Programming ) [TOC] Unity stay 3.0 after , Support automatic registration mechanism based on contract Registration By Convention, This article briefly describes how to configure . 1. adopt Nuget download Unity Version number: : ...

  4. Nginx+keepalived Double hot standby tomcat Load balancing

    Nginx+keepalived Double hot standby tomcat Load balancing Environmental statement : nginx1:192.168.2.47 nginx2:192.168.2.48 tomcat1:192.168.2.49 ...

  5. Poor visual effect - jqyery scrollTop principle

    The principle is to use the scroll height of the page scrollTop() To control the position of the background image Attach source code <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transition ...

  6. Form Regular validation is often used for forms ( Collection )

    1.^\d+$ // Match non negative integers ( Positive integer + 0) 2.^[0-9]*[1-9][0-9]*$ // Matching positive integer  3.^((-\d+)|(0+))$ // Match non positive integers ( Negtive integer + 0) 4.^-[0-9 ...

  7. &lt;Mastering KVM Virtualization&gt;: Chapter two KVM internals

    In this chapter , We will discuss libvirt.QEMU and KVM The important data structure and internal implementation of . then , We will learn more about KVM Next vCPU The implementation process of . In this chapter , We will discuss : libvirt.QEMU and KVM The internal operation of . ...

  8. About easyui Of datagrid The property is garbled

    I have been struggling with this problem for a long time , After a variety of online query summary , Here's the lesson : 1: The web page character set is set to UTF-8: <meta content="charset=UTF-8 " /&g ...

  9. my Windows Necessary software and productivity tools for installation

    Catalog System tools Working study development tool VS plug-in unit 2018 year 12 month 21 Japan , I'm going to install a new computer recently , Take this opportunity to summarize my common tools . System tools wox, Software quick start tool , There are plug-ins such as translation everything, Local documents ...

  10. Arch Linux Update source ( Tsinghua University arch Source as an example )

    Arch Linux edit ­/etc/pacman.d/mirrorlist, Add... To the top of the file : Server = https://mirrors.tuna.tsinghua.edu.cn/archl ...