Catalog

[ hide ]

Loadavg analysis

Loadavg A brief introduction

cat /proc/loadavg You can see the status of the current system load 
$ cat /proc/loadavg 
0.01 0.02 0.05 2/317 26207 
The first three values correspond to the current 1 minute 、5 minute 、15 Average in minutes load.load It is used to reflect the load of the current system , about 16 Nuclear systems , If every core cpu The utilization rate is 30%, It doesn't exist uninterruptible In the case of process , System load Should be maintained at 4.8 about . Yes 16 Nuclear system , If load Keep at 16 about , No existence uninterrptible In the case of process , It means the system CPU There is almost no idle state , The utilization rate is close to 100%. combination iowait、vmstat and loadavg We can analyze the current overall load of the system , Load distribution of each part .

Loadavg Read

In kernel /proc/loadavg It's through load_read_proc To read the corresponding data , Let's take a look at it first load_read_proc The implementation of the :

fs/proc/proc_misc.c
static int loadavg_read_proc(char *page, char **start, off_t off,
int count, int *eof, void *data)
{
int a, b, c;
int len; a = avenrun[] + (FIXED_1/);
b = avenrun[] + (FIXED_1/);
c = avenrun[] + (FIXED_1/);
len = sprintf(page,"%d.%02d %d.%02d %d.%02d %ld/%d %d\n",
LOAD_INT(a), LOAD_FRAC(a),
LOAD_INT(b), LOAD_FRAC(b),
LOAD_INT(c), LOAD_FRAC(c),
nr_running(), nr_threads, last_pid);
return proc_calc_metrics(page, start, off, count, eof, len);
}

Several macros are defined as follows :

#define FSHIFT 11 /* nr of bits of precision */
#define FIXED_1 (1<<FSHIFT) /* 1.0 as fixed-point */
#define LOAD_INT(x) ((x) >> FSHIFT)
#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)

According to the output format ,LOAD_INT The corresponding calculation is load The integral part of ,LOAD_FRAC The calculation is load Decimal part of . 
take a=avenrun[0] + (FIXED_1/200) Carry in the integral part and the decimal part, and you can get :

LOAD_INT(a) = avenrun[]/(^) + /
LOAD_FRAC(a) = ((avenrun[]%(^) + ^/) * ) / (^)
= (((avenrun[]%(^)) * + ^) / (^)
= ((avenrun[]%(^) * ) / (^) + ½

It can be seen from the above calculation results that ,FIXED_1/200 Here is the rounding of the third decimal place , Because only the first two decimal places are taken , If the third is greater than 5, And then one , Otherwise, I will give it up .

Temporary variable a/b/c It's low 11 The data stored in the load The fractional value of the , The first 11 The high position at the beginning of the bit stores the load Integral part . So you get a=load(1min) * 2^11 
So there is : load(1min) * 2^11 = avenrun[0] + 2^11 / 200 
And then deduce : load(1min)=avenrun[0]/(2^11) + 1/200 
Ignore for decimal part 3 Bit rounded 1/200, You can get load(1min)=avenrun[0] / 2^11, namely : 
avenrun[0] = load(1min) * 2^11

avenrun It's a strange quantity , How is this variable calculated , And system running process 、cpu What is the relationship between , In the second stage of analysis .

Loadavg And process

The kernel will load And load The view of is detached ,avenrun It's used to connect load The calculation and load A bridge to see . 
Let's start with the analysis avenrun Further analysis of the system load The calculation of . 
avenrun The array is in calc_load Update in

kernel/timer.c
/*
* calc_load - given tick count, update the avenrun load estimates.
* This is called while holding a write_lock on xtime_lock.
*/
static inline void calc_load(unsigned long ticks)
{
unsigned long active_tasks; /* fixed-point */
static int count = LOAD_FREQ;
count -= ticks;
if (count < ) {
count += LOAD_FREQ;
active_tasks = count_active_tasks();
CALC_LOAD(avenrun[], EXP_1, active_tasks);
CALC_LOAD(avenrun[], EXP_5, active_tasks);
CALC_LOAD(avenrun[], EXP_15, active_tasks);
}
}
static unsigned long count_active_tasks(void)
{
return nr_active() * FIXED_1;
}
#define LOAD_FREQ (5*HZ) /* 5 sec intervals */
#define EXP_1 1884 /* 1/exp(5sec/1min) as fixed-point */
#define EXP_5 2014 /* 1/exp(5sec/5min) */
#define EXP_15 2037 /* 1/exp(5sec/15min) */

calc_load At every tick It will be executed once , Every LOAD_FREQ(5s) Perform once in a cycle avenrun Update . 
active_tasks Contribute to the current system load Of task Count nr_active Multiply by FIXED_1, Used to calculate avenrun. macro CALC_LOAD The definition is as follows :

#define CALC_LOAD(load,exp,n) \
load *= exp; \
load += n*(FIXED_1-exp); \
load >>= FSHIFT;

use avenrun(t-1) and avenrun(t) Represents the last calculated avenrun And this calculation avenrun, According to CALC_LOAD Macro can get the following calculation :

avenrun(t)=(avenrun(t-) * EXP_N + nr_active * FIXED_1*(FIXED_1 – EXP_N)) / FIXED_1
= avenrun(t-) + (nr_active*FIXED_1 – avenrun(t-)) * (FIXED_1 -EXP_N) / FIXED_1

Deduce :

avenrun(t) – avenrun(t-1) = (nr_active*FIXED_1 – avenrun(t-1)) * (FIXED_1 – EXP_N) / FIXED_1

Substitute the results of the first stage derivation into the above formula , Available :

(load(t) – load(t-1)) * FIXED_1 = (nr_active – load(t-1)) * (FIXED_1 – EXP_N)

Further get nr_active Change and load The relationship between changes :

load(t) – load(t-1) = (nr_active – load(t-1)) * (FIXED_1 – EXP_N) / FIXED_1

This formula can reflect the following two points : 
1) When nr_active When it's constant ,load Will continue to approach nr_active, The rate of convergence gradually slows down  
2)nr_active The changes in the world are reflected in load It's been downgraded in terms of change , The system suddenly increased 10 A process , 
1 minute load The change of each time can only have less than 1 An increase in ( This is the distribution of weight ).

In addition, we can simplify the formula to :

load(t)= load(t-1) * EXP_N / FIXED_1 + nr_active * (1 - EXP_N/FIXED_1)

This can be more intuitive to see nr_active And history load At present load The weight relation in the game ( Thank you, master Ren Zhenyu )

#define EXP_1 1884 /* 1/exp(5sec/1min) as fixed-point */
#define EXP_5 2014 /* 1/exp(5sec/5min) */
#define EXP_15 2037 /* 1/exp(5sec/15min) */

1 minute 、5 minute 、15 Minutes EXP_N The value is as above , With EXP_N The increase of ,(FIXED_1 – EXP_N)/FIXED_1 The less it's worth , 
such nr_active The impact of change on the whole load The smaller the impact . For one nr_active Less volatile system ,load Meeting  
Constantly approaching nr_active, At the beginning, the approach was faster , With the decrease of phase difference value , The approach slows down , The closer it gets, the slower it gets , And the most  
At last nr_active. As shown in the figure below : 
file :load 1515.jpg( No graph )

So we get a conclusion ,load The direct response is in the system nr_active. that nr_active What does it include ? How to calculate  
In the current system nr_active? That's about it nr_active Sampling of .

Loadavg sampling

nr_active It directly reflects the contribution to the system load Total number of processes for , What's the total nr_active Calculation in function :

kernel/sched.c
unsigned long nr_active(void)
{
unsigned long i, running = , uninterruptible = ; for_each_online_cpu(i) {
running += cpu_rq(i)->nr_running;
uninterruptible += cpu_rq(i)->nr_uninterruptible;
} if (unlikely((long)uninterruptible < ))
uninterruptible = ; return running + uninterruptible;
} #define TASK_RUNNING 0
#define TASK_INTERRUPTIBLE 1
#define TASK_UNINTERRUPTIBLE 2
#define TASK_STOPPED 4
#define TASK_TRACED 8
/* in tsk->exit_state */
#define EXIT_ZOMBIE 16
#define EXIT_DEAD 32
/* in tsk->state again */
#define TASK_NONINTERACTIVE 64

This function reflects , Contribute to the system load There are two main types of process , One is TASK_RUNNING, One is TASK_UNINTERRUPTIBLE.
When 5s When the sampling period arrives , To each online-cpu To traverse the running queue of , Get the current time on the queue running and uninterruptible Of
Number of processes as current cpu Of load, each cpu load The sum of is the result of this sampling nr_active.

The following example illustrates the 2.6.18 In the case of kernel loadavg The calculation method of :

18 kernel loadavg Calculation
  cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7 calc_load
0HZ+10 1 1 1 0 0 0 0 0 0
5HZ 0 0 0 0 1 1 1 1 4
5HZ+1 0 1 1 1 0 0 0 0 0
5HZ+9 0 0 0 0 0 1 1 1 0
5HZ+11 1 1 1 0 0 0 0 0 0

18 Kernel Computing loadavg The problem is

xtime_lock analysis

Kernel in 5s Execute once a cycle load Update , These are all in calc_load Execute in function . Pursue calc_load Call to :

kernel/timer.c
static inline void update_times(void)
{
unsigned long ticks; ticks = jiffies - wall_jiffies;
wall_jiffies += ticks;
update_wall_time();
calc_load(ticks);
}

update_times Update system in wall time, Then perform the global load Update .

kernel/timer.c
void do_timer(struct pt_regs *regs)
{
jiffies_64++;
/* prevent loading jiffies before storing new jiffies_64 value. */
barrier();
update_times();
}

do_timer The global clock is executed first in jiffies Update , And then there was update_times.

void main_timer_handler(struct pt_regs *regs)
{
...
write_seqlock(&xtime_lock);
...
do_timer(regs);
#ifndef CONFIG_SMP
update_process_times(user_mode(regs));
#endif
...
write_sequnlock(&xtime_lock);
}

Yes wall_time And the big picture jiffies All of the updates are serial locks (sequence lock)xtime_lock After that .

include/linux/seqlock.h
static inline void write_seqlock(seqlock_t *sl)
{
spin_lock(&sl->lock);
++sl->sequence;
smp_wmb();
} static inline void write_sequnlock(seqlock_t *sl)
{
smp_wmb();
sl->sequence++;
spin_unlock(&sl->lock);
} typedef struct {
unsigned sequence;
spinlock_t lock;
} seqlock_t;

sequence lock Internal protection is used to count sequence.Sequence lock The lock is written through the spin_lock Realized , 
stay spin_lock After the sequence The counter performs a self increment operation , Then do it again before the lock is released sequence Self - increment operation . 
sequence When initialized 0. such , When inside the lock sequence In an odd number of , Describe the current sequence lock My writing lock is being taken , 
Reading and writing may not be safe . If in the process of writing , It's not safe to read , Then you need to wait for the write lock to complete while reading . The corresponding read lock is used as follows :

#if (BITS_PER_LONG < 64)
u64 get_jiffies_64(void)
{
unsigned long seq;
u64 ret; do {
seq = read_seqbegin(&xtime_lock);
ret = jiffies_64;
} while (read_seqretry(&xtime_lock, seq));
return ret;
} EXPORT_SYMBOL(get_jiffies_64);
#endif

The implementation of read lock is as follows :

static __always_inline unsigned read_seqbegin(const seqlock_t *sl)
{
unsigned ret = sl->sequence;
smp_rmb();
return ret;
} static __always_inline int read_seqretry(const seqlock_t *sl, unsigned iv)
{
smp_rmb();
/*iv Lock counter before read
* When iv When the base number is 0 , Note that the writing lock is taken during the reading process , Error value may be read
* When iv For the even , But the count of the lock after reading is different from that before reading , It means that the lock is taken in the process of reading ,
* It is also possible to read an error value .
*/
return (iv & ) | (sl->sequence ^ iv);
}

thus xtime_lock The implementation of parsing is completed , Since the corresponding write lock is based on spin_lock Realization , When multiple programs compete to write locks, the wait will wait in a loop , 
When it takes too long to handle inside the lock , It will lead to the delay growth of the whole system . in addition , If there are many problems in the system xtime_lock Read lock , On a certain journey  
After getting the write lock , Read the lock and you'll go into something like this spin_lock Loop query status of , Until the correct value can be read . So it needs to be done as much as possible  
Short cut in xtime_lock Processing flow between write locks .

overall situation load Read write separation solution xtime_lock problem

In computing global load function calc_load in , Every time 5s You need to traverse all of them once cpu The running queue of , Get corresponding cpu Upper load.1) because cpu The number is not fixed  
Set the , cause calc_load The execution time of is not fixed , In the case of a large number of kernels, it will cause xtime_lock The acquisition time is too long .2)calc_load yes  
Every time 5s One time sampling procedure , It can't be very accurate , On the whole avenrun There is no need for special lock protection between read and write , You can put the global load Of  
Update and read are separated . 
Dimitri Sivanich Put forward in their large SMP On the system , because calc_load Traversal takes all of online CPU, The system delay is large . 
For the above reasons Thomas Gleixnert The following documents were submitted patch For the bug Make repairs :

[Patch 1/2] sched, timers: move calc_load() to scheduler
[Patch 2/2] sched, timers: cleanup avenrun users

file :rw isolate.jpg

Thomas Of the two patch, The main idea is shown in the figure above . First of all, the overall situation load The calculation is separated into per-cpu On , each cpu Count up load Time does not add xtime_lock 
Lock of , Calculated load Update to global calc_load_tasks in , all cpu On load After calculation calc_load_tasks It is the whole load. stay 5s set  
Execute when timer arrives calc_global_load, Read Global cacl_load_tasks, to update avenrun. Because it's just a simple read calc_load_tasks, 
Execution time and cpu The number doesn't matter .

A few key points :

No addition xtime_lock Of per cpu load Calculation

In does not add xtime_lock Under the circumstances , How to ensure every update avenrun When did you read it calc_load_tasks For all cpu Updated load?

Thomas Solutions for

Thomas The way to do this is to put the timer in the sched_tick in , Every cpu All set up one LOAD_FREQ Timer . 
Execution on the current processor when the timing cycle arrives load The calculation of .sched_tick At every tick Execute on arrival  
once ,tick Arrival is controlled by hardware , Objectively, it is not affected by the operation of the system .

sched_tick The timing of

take per-cpu load The calculation is put to sched_tick In the implementation of , First of all, this is not a return to the time processing interrupt , Is it still  
There is xtime_lock problem ? Following pair sched_tick Analyze ( The following analysis is based on linux-2.6.32-220.17.1.el5 Source code )

static void update_cpu_load_active(struct rq *this_rq)
{
update_cpu_load(this_rq); calc_load_account_active(this_rq);
} void scheduler_tick(void)
{
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
...
spin_lock(&rq->lock);
...
update_cpu_load_active(rq);
...
spin_unlock(&rq->lock); ...
} void update_process_times(int user_tick)
{
...
scheduler_tick();
...
} static void tick_periodic(int cpu)
{
if (tick_do_timer_cpu == cpu) {
write_seqlock(&xtime_lock); /* Keep track of the next tick event */
tick_next_period = ktime_add(tick_next_period, tick_period); do_timer(); // calc_global_load stay do_timer In the called
write_sequnlock(&xtime_lock);
} update_process_times(user_mode(get_irq_regs()));
...
} void tick_handle_periodic(struct clock_event_device *dev)
{
int cpu = smp_processor_id();
...
tick_periodic(cpu);
...
}

Staggered time difference

take per-cpu load The calculation of sched_tick In the after , There is also a question of when to implement per-cpu Upper load Calculation , How to ensure that the update is complete  
game avenrun Global read when load For all cpu All after calculation ? The current approach is to give all cpu Set the same step time LOAD_FREQ, 
After this period, there will be tick When it arrives, it will be executed cpu On load The calculation of , Update to global calc_load_tasks.calc_global_load 
The execution point is LOAD_FREQ+10, That is, in all cpu load The calculation is finished 10 ticks after , Read Global calc_load_tasks to update avenrun.

32 kernel loadavg Calculation
  cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7 calc_load_tasks
0HZ+10 0 0 0 0 0 0 0 0 0
5HZ 1 1 1 1 1 1 1 1 0
5HZ+1 0 1 1 1 0 0 0 0 0
    +1 +1 +1         1+1+1=3
5HZ+11 0 1 1 1 0 0 0 0 3
calc_global_load <-- -- -- -- -- -- -- -- 3

By way of calc_global_load and per-cpu load The calculated time is interleaved , You can avoid calc_global_load In all cpu load Execution between calculations , 
Lead to load Inaccurate sampling .

32 kernel Load Count nohz problem

The solution of a problem , Often accompanied by the birth of countless other problems !Per-cpu load It's a good way to separate the whole situation load Update and read of , Avoid large system cpu 
Too many kernels xtime_lock problem . But it also brings a lot of other problems that need to be solved . The main problem is that nohz problem .

To avoid cpu Lots of meaningless clock interrupts in idle state , Introduced nohz technology . With this technology ,cpu After entering the idle state, the cpu Corresponding clock interrupt , etc.  
Until the next timer arrives , Or should cpu Turn on the clock interrupt again when rescheduling is needed .

cpu Get into nohz After status, the cpu The clock on the computer tick stop it , Lead to sched_tick Not every tick It will be executed once . This makes it possible to per-cpu Of load The calculation is placed in the  
sched_tick There's no guarantee that every LOAD_FREQ It's all done once . If in execution per-cpu load When calculating , At present cpu be in nohz state , So when  
front cpu Upper sched_tick You'll miss , And miss this one load Update , The ultimate global load The calculation is not accurate . 
be based on Thomas first patch Thought , Can be in cpu Dispatch idle When the nohz Deal with the situation . The way to do it is in the current cpu Get into idle Go ahead once cpu 
On load Update , In this way, even if you enter the nohz state , The cpu Upper load It has also been updated to the latest status , There will be no update . As shown in the figure below :

32 kernel loadavg Calculation
  cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7 calc_load_tasks
0HZ+11 1 1 1 0 0 0 0 0 3
5HZ 0 0 0 0 3 2 1 3 0
  -1 -1 -1           3-3=0
5HZ+1 0 1 1 1 1 1 1 1 1
    +1 +1 +1 +1 +1 +1 +1 0+1+...+1=7
5HZ+11 0 1 1 1 1 1 1 1 7
calc_global_load <-- -- -- -- -- -- -- -- 7

Theoretically , The solution is very good nohz State causes global load The count may not be accurate , In fact, this is the beginning of a bitter fruit . A lot of online application feedback  
The latest kernel load There is a problem with counting , stay 16 Nuclear machine cpu The average utilization rate is 20%~30% Under the circumstances , whole load But always lower than 1.

Solution

Received our online report load After the problem of low count , A study was carried out . At first, they were suspicious of the overall situation load There is competition for count updates . Yes 16 Nuclear systems , If you don't get in  
nohz state , So this 16 All of them will be in the future LOAD_FREQ The one that the cycle arrives at tick Internal execution per-cpu load The calculation of , And update to the global load in , this  
If there is competition between them , Overall calculation load Will go wrong . At present, every cpu Corresponding rq They're all defending the right cpu Last calculated load value , If this calculation is found load 
Same as last maintenance load The difference between the values is 0, You do not need to update the global load, Otherwise, the difference value is updated to the global value load in . It's because of this mechanism , overall situation load If you are  
Tampering , Well, in all areas cpu I'm defending myself load Under the circumstances , overall situation load In the end, negative values may appear . And the negative value through a variety of observations , It didn't show up online , Final competition  
The dispute was ruled out .

adopt /proc/sched_debug Analyze the online dispatching information , Find every moment in the cpu The process running on is basically maintained at 2~3 individual , Every time there is a process running cpu all  
Dissimilarity . Further analysis , Every cpu On average, it appears every second sched_goidle It's about 1000 Times or so . So get online every time you enter idle The interval is 1ms/ Time . 
combination 1HZ=1s=1000ticks, You can get 1tick =1ms. So you can get almost every online application tick It's going to be one idle!!! This discovery is like  
I always see a drop of water with my naked eye , Looking so perfect, so pure , Suddenly I put a magnifying glass in front of you , All kinds of messy debris appeared . In the original world , 
10ticks It's so short , A process may not run completely , Now we find that 10ticks Internal scheduling idle The number of times will be close 10 Time . Then we use examples to analyze the application scenarios :

32 kernel loadavg Calculation
  cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7 calc_load_tasks
0HZ+11 1 1 1 0 0 0 0 0 3
5HZ 0 0 0 1 1 1 0 0  
  -1 -1 -1           3-3=0
5HZ+1 1 0 0 0 0 0 1 1  
  +1           +1 +1 0+1+1+1=3
5HZ+3 0 1 1 1 0 0 0 0 3
  -1           -1 -1 3-1-1-1=0
5HZ+5 0 0 0 0 1 1 1 0 0
5HZ+11 1 0 0 0 0 0 1 1 0
calc_global_load <-- -- -- -- -- -- -- -- 0

( explain : Maybe you noticed that in the 5HZ+5 To 5HZ+11 There are also problems in the process CPU Never idle Into the idle, But why not -1, Here is because of each cpu All reserved  
I got a copy of it CPU In the last calculation load, If load If there is no change, no calculation will be made , These are a few cpu Last calculation load by 0, There is no change )

Orz!load by 3 It's a direct calculation 0, No wonder the system as a whole load On the low side . One of the key points is that : Yes, it has been calculated load Of cpu, We are right. idle Into the  
We did the calculation , But I never thought about it idle Enter non idle The unfairness of the situation . This is the current online 2.6.32 The problems in the system . Locate the problem in  
after , Follow up to upstream Found in Peter Z For this load The counting problem has been submitted three times patch, The latest one patch Is in 4 Submit... In month . These three  
patch as follows :

[Patch] sched: Cure load average vs NO_HZ woes
[Patch] sched: Cure more NO_HZ load average woes
[Patch] sched: Fix nohz load accounting – again!

This is our current situation backport Of patch, The basic idea is to get into idle Caused by the load The changes are temporarily recorded , Not every time you enter idle All lead to the overall situation load Update . 
The difficulty here is when will idle Update to global load in ? At the beginning of the calculation per-cpu load You need to put all the previous idle Count it all in , 
Due to the various CPU perform load The order of calculation has not been decided yet , So put this calculation in each cpu It's a way of doing it all over again . Then an example is used to illustrate :

32 kernel loadavg Calculation
  cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7 calc_load_tasks tasks_idle
0HZ+11 1 1 1 0 0 0 0 0 3 0
5HZ 0 0 0 1 1 1 0 0    
  -1 -1 -1           3 -3
5HZ+1 1 0 0 0 0 0 1 1 3
  +1           +1 +1 3-3+1+1+1=3 0
5HZ+3 0 1 1 1 0 0 0 0 3
5HZ+3 -1           -1 -1 3 -1-1-1=-3
5HZ+5 0 0 0 0 1 1 1 0 3
5HZ+11 1 0 0 0 0 0 1 1 3
calc_global_load <-- -- -- -- -- -- -- -- 3 -3

So far, the three patch Can deal with our previous entry very well idle The problem of . 
Put the above three patch After finishing , Test in Taoke front-end online machine , The test results show that load It has been improved obviously .

Finer grained time problems

Put the above three patch After finishing , It seems that everything is perfect ,idle It was dealt with very well , overall situation load The separation of read and write is also well implemented . However, the test results on the business line were unexpected , Although adding patch after load The count was significantly improved , But it's still low . Here's a grab trace data ( In bold pick_next_idle):

<...>-9195 [000] 11994.232382: calc_global_load: calc_load_task = 0
<...>-9198 [000] 11999.213365: calc_load_account_active: cpu 0 nr_run 1 nr_uni 0 nr_act 1 delta 1 calc 1
<...>-9199 [001] 11999.213379: calc_load_account_active: cpu 1 nr_run 1 nr_uni 0 nr_act 1 delta 1 calc 2
<...>-9194 [002] 11999.213394: calc_load_account_active: cpu 2 nr_run 1 nr_uni 0 nr_act 1 delta 1 calc 3
<...>-9198 [000] 11999.213406: calc_load_account_active: cpu 0 nr_run 0 nr_uni 0 nr_act 0 delta -1 calc 2
<...>-9201 [003] 11999.213409: calc_load_account_active: cpu 3 nr_run 1 nr_uni 0 nr_act 1 delta 1 calc 3
<...>-9190 [004] 11999.213424: calc_load_account_active: cpu 4 nr_run 1 nr_uni 0 nr_act 1 delta 1 calc 4
<...>-9197 [005] 11999.213440: calc_load_account_active: cpu 5 nr_run 1 nr_uni 0 nr_act 1 delta 1 calc 5
<...>-9194 [002] 11999.213448: calc_load_account_active: cpu 2 nr_run 0 nr_uni 0 nr_act 0 delta -1 calc 4
<...>-9203 [006] 11999.213455: calc_load_account_active: cpu 6 nr_run 1 nr_uni 0 nr_act 1 delta 1 calc 5
<...>-9202 [007] 11999.213471: calc_load_account_active: cpu 7 nr_run 1 nr_uni 0 nr_act 1 delta 1 calc 6
<...>-9195 [008] 11999.213487: calc_load_account_active: cpu 8 nr_run 1 nr_uni 0 nr_act 1 delta 1 calc 7
<...>-9204 [009] 11999.213502: calc_load_account_active: cpu 9 nr_run 1 nr_uni 0 nr_act 1 delta 1 calc 8
<...>-9190 [004] 11999.213517: calc_load_account_active: cpu 4 nr_run 0 nr_uni 0 nr_act 0 delta -1 calc 7
<...>-9192 [010] 11999.213519: calc_load_account_active: cpu 10 nr_run 1 nr_uni 0 nr_act 1 delta 1 calc 8
<...>-9200 [011] 11999.213533: calc_load_account_active: cpu 11 nr_run 1 nr_uni 0 nr_act 1 delta 1 calc 9
<...>-9189 [012] 11999.213548: calc_load_account_active: cpu 12 nr_run 1 nr_uni 0 nr_act 1 delta 1 calc 10
<...>-9196 [013] 11999.213564: calc_load_account_active: cpu 13 nr_run 1 nr_uni 0 nr_act 1 delta 1 calc 11
<...>-9193 [014] 11999.213580: calc_load_account_active: cpu 14 nr_run 1 nr_uni 0 nr_act 1 delta 1 calc 12
<...>-9191 [015] 11999.213596: calc_load_account_active: cpu 15 nr_run 1 nr_uni 0 nr_act 1 delta 1 calc 13
<...>-9204 [009] 11999.213610: calc_load_account_active: cpu 9 nr_run 0 nr_uni 0 nr_act 0 delta -1 calc 12<...>-9195 [008] 11999.213645: calc_load_account_active: cpu 8 nr_run 0 nr_uni 0 nr_act 0 delta -1 calc 11<...>-9203 [006] 11999.213782: calc_load_account_active: cpu 6 nr_run 0 nr_uni 0 nr_act 0 delta -1 calc 10<...>-9197 [005] 11999.213809: calc_load_account_active: cpu 5 nr_run 0 nr_uni 0 nr_act 0 delta -1 calc 9<...>-9196 [013] 11999.213930: calc_load_account_active: cpu 13 nr_run 0 nr_uni 0 nr_act 0 delta -1 calc 8<...>-9193 [014] 11999.213971: calc_load_account_active: cpu 14 nr_run 0 nr_uni 0 nr_act 0 delta -1 calc 7<...>-9189 [012] 11999.214004: calc_load_account_active: cpu 12 nr_run 0 nr_uni 0 nr_act 0 delta -1 calc 6<...>-9199 [001] 11999.214032: calc_load_account_active: cpu 1 nr_run 0 nr_uni 0 nr_act 0 delta -1 calc 5<...>-9191 [015] 11999.214164: calc_load_account_active: cpu 15 nr_run 0 nr_uni 0 nr_act 0 delta -1 calc 4<...>-9202 [007] 11999.214201: calc_load_account_active: cpu 7 nr_run 0 nr_uni 0 nr_act 0 delta -1 calc 3<...>-9201 [003] 11999.214353: calc_load_account_active: cpu 3 nr_run 0 nr_uni 0 nr_act 0 delta -1 calc 2<...>-9192 [010] 11999.214998: calc_load_account_active: cpu 10 nr_run 0 nr_uni 0 nr_act 0 delta -1 calc 1<...>-9200 [011] 11999.215115: calc_load_account_active: cpu 11 nr_run 0 nr_uni 0 nr_act 0 delta -1 calc 0
<...>-9198 [000] 11999.223342: calc_global_load: calc_load_task = 0

Although this is not added three patch Previous trace data , But we can still find some problems : The original 10tick For us, from a small time piece to a big time piece , It is one order of magnitude lower than this 1 tick But it has not been really valued by us .trace In the data ,cpu0、2、4 After calculating their own load after , other cpu Calculate your own load Before , Into the idle, Because by default, each cpu Will go to idle Calculated into global load in , This part goes in idle Caused by the cpu load What happens is calculated globally load in . It's still before 10ticks The unfairness of the law . Examples are as follows :

32 kernel loadavg Calculation
  cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7 calc_load_tasks tasks_idle
0HZ+11 1 1 1 0 0 0 0 0 3 0
5HZ 0 0 0 1 1 1 0 0    
  -1 -1 -1           3 -3
5HZ+1.3 1 0 0 0 0 0 1 1  
  +1               3-3+1=1 0
5HZ+1.5 0 1 1 1 0 0 0 0 1 0
  -1     +1         1+1-1=1 0
5HZ+1.7 0 0 0 0 1 1 1 0 0 0
        -1     +1   1-1+1=3 0
5HZ+3 0 1 1 1 0 0 1 0    
              -1   1 -1
5HZ+5 0 0 0 0 1 1 1 0  
5HZ+11 1 1 0 0 0 0 1 -1  
calc_global_load <-- -- -- -- -- -- -- -- 1 -1

The average running time of each task of online business is 0.3ms, The task run cycle is 0.5ms, So every cycle idle The execution time is 0.2ms. stay 1 individual tick Inside ,cpu Carry out oneself load After the calculation of , There's a good chance that it will be in other places cpu Execute yourself load Enter before calculating idle, Cause the whole load The calculation is right idle He Fei idle Unfair ,load Inaccurate count . For this problem , A simple solution is to detect the first to start executing load Calculated CPU, Only in this CPU Admiral, all of you have entered idle Calculated load Update to global load, After that CPU No more idle Update to global load in . The first one in this scheme starts to execute load Calculated CPU It's difficult . Another solution is to LOAD_FREQ Periodic point and global load Update to avenren Of LOAD_FREQ+10 The time point is the dividing point . For the last time LOAD_FREQ+10 To the point of this cycle idle load, It can be used in this project CPU perform load Update to global when calculating load; After the period point to LOAD_FREQ+10 Between time points idle load Can be in the overall situation load Update to avenrun Then update to global load. 
Peter Z The second solution mentioned above is adopted , Use idx Technical realization of flipping . adopt LOAD_FREQ and LOAD_FREQ+10 Two time points , Can be idle As a result of load In two parts , Part of it is LOAD_FREQ to LOAD_FREQ+10 This part , This part load Due to the fact that cpu Calculation load Then to the whole situation avenrun Between updates , You should not update directly to the global load in ; The other part is LOAD_FREQ+10 To the next cycle point LOAD_FREQ, This part idle As a result of load It can be updated to global at any time load in . The implementation uses a 2 Array of elements , For the two parts load For storage , But these two parts are not stored separately in different elements of the array , But each LOAD_FREQ Cycle stores an element . As shown in the figure below , stay 0~5 In cycle , These two parts idle All are stored in the array with the subscript 1 Element of .5~10 In cycle , Both of these parts are stored in the array with the subscript 0 Element of . stay 5~10 In cycle , each cpu Calculation load Read from idle by 0~5 Periodic storage ; At the end of the calculation avenrun after , to update idle To the overall situation load When reading is 5~10 Before the middle of the cycle 10 individual ticks Of idle As a result of load. In this way 10~15 In cycle , each cpu Calculation load Read from idle It is an update avenrun After that idle load. The specific implementation scheme is as follows :

 0 5 10 15 --->HZ
+10 +10 +10 +10 ---> ticks
|-|-----------|-|-----------|-|-----------|-|
idx:0 1 1 0 0 1 1 0
w:0 1 1 1 0 0 0 1 1 1 0 0
r:0 0 1 1 1 0 0 0 1 1 1 0
 explain :1)0 5 10 15 Represented by 0HZ、5HZ、10HZ、15HZ, That's what we're talking about cpu perform load Periodic point of calculation 
2)+10 After the period point 10ticks( This is the calculation avenrun The timing of the )
3)idx Represents the current idx value ( Only take the last value at a time , Therefore, the range of variation is 0~1)
4)w Back 3 The column value , The first column represents before the periodic point idle The array to which the calculated value is written idx; The second column represents the period point to +10 Between idle As a result of load Changes the number of writes
Group idx; The third column represents the calculation avenrun And then to the next periodic point idle Array written to idx;

This is illustrated by the following example ( Assume 0HZ+11 after idx by 0):

32 kernel loadavg Calculation
  cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7 calc_load_tasks idle[0] idle[1] idx
0HZ+11 1 1 1 0 0 0 0 0 3 0 0 0
5HZ 0 0 0 1 1 1 0 0    
  -1 -1 -1           3 -3 0 0
5HZ+1.3 1 0 0 0 0 0 1 1  
  +1               3-3+1=1 0 0 0
5HZ+1.5 0 1 1 1 0 0 0 0 1 0
  -1     +1         1+1=2 0 -1 0
5HZ+1.7 0 0 0 0 1 1 1 0 0 0
        -1     +1   2+1=3 0 -2 0
5HZ+3 0 1 1 1 0 0 1 0 0
5HZ+3                 3 0 -2 0
5HZ+5 0 0 0 0 1 1 1 0 0
5HZ+11 1 1 0 0 0 0 1 1  
calc_global_load <-- -- -- -- -- -- -- -- 3 0 -2 0
                  3-2=1 0 0 1
5HZ+15 1 1 0 0 0 0 0 1    
              -1   1 0 -1 1

Return to fairness again

After fine-grained analysis idle To solve the scheduling problem , Online business as a whole load It has been greatly improved . The original average number of running processes is 16 Under the circumstances ,load Have been wandering in 1 about , After improvement load It's back to 15 about . 
But this patch Publish to community , After the relevant reports load Count the community people with problems after testing , Discover the system's load Overall high , And most of the time, it is close to the total number of running processes of the system . To test this patch The effect of , Upgraded one to add the patch Machine , To observe , Did you find that the upgrade machine load Compared with the original 18 Still higher than 1 about . 
It's another deep thought , Is this the current one patch in BUG? From the first CPU To the last CPU Between idle It should be calculated directly on the whole load in ? For high frequency scheduling idle The situation of , This part idle It should not be added to the overall situation load in , Otherwise, no matter how many processes the system runs , Final load Will always linger in 0 about . So this part idle Must not be able to join the global load in . adopt trace Data analysis , Also proved that patch The operation behavior is in line with the expectation , There is no exception . 
If we assume all the previous patch No problem , Is there any other situation that could cause the system to load High ? Lead to load High , One of the most likely reasons is that the calculation is idle when , Calculated as non idle situation . For this reason, the assumption of load balancing has been put forward 、 Calculation load There is a process from time to time wakeup Assumptions to the current running queue , All of them were eventually ruled out one by one . 
Further observation trace data , I found that almost every time I finished the work CPU On load After calculation , The CPU Enter immediately idle.16 individual CPU, Every CPU It's all in Africa idle When it comes to execution load Calculation , After execution load After calculation, they all enter immediately idle. And it's every time you do it load It's always the same in calculation , Not by chance . According to the sampling logic , Because the sampling time is not affected by the system operation , For frequent in and out idle The situation of , When sampling idle He Fei idle It should be . Today, there are only non-governmental organizations idle situation , It means that there is a problem in the selection of sampling time . 
Further analysis , If the sampling point is in the idle Inside , because nohz Cause to enter idle It is not executed periodically after that sched_tick, It can't be executed load Calculation , It seems to lead to idle load Calculation lost . The truth is not , Calculated before idle load Just to avoid entering nohz Lead to load The problem of missing calculation , When entering idle Before scheduling, the current cpu Upper load Calculated in idle load in , So other cpu perform load This part will be used in the calculation load Count them together . 
But based on the above logic , A conclusion can also be drawn : If the sampling point is at idle Inside , The default should be to enter idle At the time of the load As it should be cpu On the sampling load. Is that the case ? Continue analysis , The CPU If from nohz Reenter scheduling , At this time, because the sampling time point still exists , And the last sampling interval has been more than one LOAD_FREQ cycle , Will be executed again load Calculation . Re execution load The calculation will cover the original data idle Calculated at load, A direct result of this is that , The CPU The sample points on the idle Inside, it's not idle! The problem has become clear , For the sampling point in the idle Internal situation , Actual calculation load It should be for entry idle When cpu Upper load, However, due to the cpu The upsampling time point is not updated , Lead to exit nohz The state is then executed again load Calculation , It will eventually exit nohz After state load For sampling load.

The problem is clear , The solution is simple : In the exit nohz The sampling time point is detected before the current time point , If it is , It means that the sampling time is at idle Inside , this There is no need to recalculate the CPU Upper load.
 
 

linux loadavg Detailed explanation (top cpu load) More articles about

  1. linux Detailed command ——top

    brief introduction TOP Is a dynamic display process , That is to say, the current status can be refreshed continuously by pressing the button of the user . If the command is executed in the foreground , It will have exclusive foreground , Until the user terminates the program . To be more precise ,top Command provides real-time status monitoring of system processors . It will show ...

  2. ( turn )Linux PS Detailed explanation

    original text :https://cn.aliyun.com/jiaocheng/162702.html Abstract : Original address :http://www.cnblogs.com/wangkangluo1/archive/2 ...

  3. Linux Catalog details Tree structure

    1. Tree structure 2./ Catalog Catalog describe / The root of the first hierarchy . The root of the entire file system hierarchy . /bin/ Need the necessary commands available in single user mode ( Executable file ): For all users , for example :cat.ls.cp, and / ...

  4. Linux Detailed explanation of system structure

    Linux  Detailed explanation of system structure Linux The system generally has 4 Major parts : kernel .shell. File systems and Applications . kernel .shell Together with the file system, it forms the basic operating system structure , They allow users to run programs . Manage files and use the system ...

  5. Linux Memory details

    --Linux Memory details -----------------2014/05/24 Linux It doesn't look like windows So intuitive , This article is going to introduce it in detail Linux Of memory . Look, there are linux command f ...

  6. Docker Basic technology Linux cgroups Detailed explanation

    PS: Welcome to my official account :aCloudDeveloper, Focus on technology sharing , Strive to build a dry goods sharing platform , QR code can be scanned at the end of the text , Thank you. . I recommend you to the official account , Better reading experience there , Also precipitated a lot of dry goods . The first two ...

  7. [ Re posting ]Linux File system details

    Linux File system details https://www.cnblogs.com/alantu2018/p/8461749.html The thief is complicated .. Detailed explanation from the perspective of operating system Linux File system hierarchy . File system classification . The document department ...

  8. Linux Order to explain —tail command

    tail Command is also a very common file view class command , Today I'd like to introduce you Linux tail command reference . more Linux For details of the order, please see :Linux Command quick reference manual Linux tail Command is mainly used to start the text from the specified point ...

  9. Linux Order to explain —less command

    Linux Next there's another one with more The command is very similar to the command --less command , Compared with more command ,less The command is more flexible and powerful , Today I'd like to introduce Linux Under the less command . more Linux For details of the order, please see :Linu ...

Random recommendation

  1. CSS3 Button effect

    come from codepen,http://codepen.io/PalashSharma20/pen/YWBAgN Knowledge point : Center the screen .transform.transition.transition-delay ...

  2. stay ASP.NET MVC Use in Unity There are three ways to do dependency injection

    stay ASP.NET MVC4 in , In order to unravel Controller and Model The coupling of , We usually need to be in Controller Activate the system by introducing IoC, Used to process user requests Controller, Give Way Controller ...

  3. Java Out of heap memory usage

    For the recovery of off heap memory, see HeapByteBuffer and DirectByteBuffer And recycling DirectByteBuffer Basic type length stay Java There are many basic types of , such as : byte, A byte is 8 position bi ...

  4. eclipse Some initialization settings for

    eclipse Download address :http://www.eclipse.org/ add to java Comment template codetemplates.xml:Window->Preferences->Java-> ...

  5. js Object summary

    Prelude The object is js Basic data types for , Except for strings, to be precise , Numbers ,boolean value ,null And undifine outside ,js The values in are all objects .js The object in is a composite value , He will be worth a lot of ( Original values or other objects ) Come together , Sure ...

  6. window Next gvim Chinese interface changed to English interface

    Set in Chinese environment GVIM The interface of . menu . The prompt is in English Amend your _vimrc, Usually similar C:\Program Files\Vim Add the following statement to the end " set the menu & t ...

  7. nodejs Simple http Upload files demo

    // This is an easy one Node HTTP, Can process files in the current directory // And it can realize the special breeding of improved varieties URL Used for testing // use http://localhost:8000 or http://127.0.0.1:8000 ...

  8. Mac Next Chrome browser ERR_NETWORK_CHANGED Error reporting solutions

    I always thought it was  SwitchyOmega and SpechtLite The problem of , The original is Alipay security control. . Because Alipay doesn't need it now Mac Security control mechanism , So it can be done by terminal Run the following command to remove s ...

  9. 【 Reading notes 】iOS- Streaming audio and video Pandora Radio The way

    Complexity is inevitable , And it will only grow over time , So when adding features, be sure to make time for refactoring and code simplification . Don't worry about performance before you really encounter problems .iPhone Very strong , You may never have the performance problems you expect . Be able to access a device over the Internet ...

  10. Problem solving :NOI 2009 Poet small G

    Topic Did you take the exam today , So I began to study the monotony of decision making DP It's a problem that doesn't optimize at all DP People who Monotony of decision DP One of the optimization methods is to use monotone queue optimization keep for the future { Left end point l, Right endpoint r, The best decision point p} Triple of , According to the normal operation of monotonic queues ...