Case sharing: Online failure caused by Dubbo 2.7.12 bug

Insect catching master 2021-10-14 06:58:27


Late one night recently , Just after taking a bath, I received a call from the business side , Talk about them dubbo The service is out of order , You want me to help you check .

The phone , Asked them what time

  • Is it a damaged fault on the line ?—— yes
  • Did you stop the loss ?—— Stop loss
  • Is there a reserved scene ?—— No,

So I turned on the computer , Even on VPN Look at the problem . For the sake of understanding , The architecture is simplified as follows


Just pay attention A、B、C Three services , The calls between them are dubbo call .

In case of failure B The service has several machines completely rammed to death , Can't handle the request , The remaining normal machine requests surge , Time consuming increase , Here's the picture ( Figure 1 request volume 、 Figure 2 )


Troubleshoot problems

As the site has been damaged , You can only look at the monitoring and log first

  • monitor

In addition to the above monitoring , Looking at the B service CPU And memory , The memory of several machines that found faults increased more , All have reached 80% Horizontal line of , And CPU Consumption also becomes more


At this time, I doubt the memory problem , So I took a look JVM Of fullGC monitor


Sure enough fullGC Time goes up a lot , Basically, it can be concluded that the service is unavailable due to a memory leak . But why is there a memory leak , There's no clue yet .

  • journal

Apply for machine permission , Check the log , Found a very strange WARN journal

[dubbo-future-timeout-thread-1] WARN org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelTimeout
-  [DUBBO] An exception was thrown by TimerTask., dubbo version: 2.7.12, current host:
rejected from java.util.concurrent.ThreadPoolExecutor@7a9f0e84[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 21]

It can be seen that the business party uses 2.7.12 Version of dubbo

Take this log dubbo Of github The warehouse searched , Found the following issue:


But the problem was quickly ruled out , Because in 2.7.12 There is already fixed code in the version .

Continue to find these two issue:

Judging from the error report and version , It's completely in line , But there is no mention of memory problems , Forget the memory problem first , See if you can follow #8188 This issue Reappear


issue It is also clear how to reproduce , So I took these three services to reproduce , It didn't reappear at first . Push back by fixing the code


There is a problem deleting the code part , But it's hard for us to get into this , How can I get in ?

Here is a feature On behalf of a request , Only when the request is not completed will it enter , That's easy , Give Way provider Never return , It can certainly be achieved , So in provider Add end test code


After the test, it reappeared , Such as issue said , When kill -9 Drop the first provider when , The overall situation of consumers ExecutorService Shut down , When kill -9 the second provider when ,SHARED_EXECUTOR Also closed .

So what is this thread pool used for ?

It's in HashedWheelTimer Is used to detect consumer Whether the request timed out .

HashedWheelTimer yes dubbo Implementation of a time wheel to detect whether the request times out , The details will not be expanded here , You can write a detailed article another day dubbo Middle time wheel algorithm .

When the request is sent , If you can return normally, it's ok , But if it exceeds the set timeout, it has not returned , You need the task of this thread pool to detect , Interrupt a task that has timed out .

The following code is to submit the task , When the thread pool is closed , When you submit a task, you throw an exception , Timeout cannot be detected .

public void expire() {
    if (!compareAndSetState(ST_INIT, ST_EXPIRED)) {
    try {;
    } catch (Throwable t) {
        if (logger.isWarnEnabled()) {
            logger.warn("An exception was thrown by " + TimerTask.class.getSimpleName() + '.', t);

Here, I suddenly realized : If the request keeps sending , No timeout , Is it possible to burst the memory ? So I simulated it again , And opened 3 A thread has been requesting  provider, Sure enough, the memory burst scene reappears , And when it doesn't trigger the problem , Memory is always stable at a low level .


I used it here arthas Look at the memory changes , Very convenient

Come to the conclusion

After local reproduction , So check with the business side , The recurrence of this problem is still relatively harsh , First of all Asynchronous call , secondly provider Abnormal offline is required , Last  provider There needs to be a blockage , That is, the request never returns .

The asynchronous call is confirmed by the business party ,provider Abnormal offline , This is more common , This happens when the container drifts due to the failure of the physical machine , Last provider This has been confirmed by the business party , exactly C The service had a machine that froze near that point in time , Unable to process request , But the process is alive .

So the question is dubbo 2.7.12 Of bug Lead to . Look at this bug yes 2.7.10 introduce , 2.7.13 Repair .


It's almost spent 1 Days to locate and reproduce , Fairly smooth , Good luck , No detours , But there are also some areas that need attention .

  • It's best to keep the scene while stopping the loss , If this time before restart dump Remove the memory or remove the flow to keep the machine on site , May help speed up locating problems . Such as configuration OOM Automatically dump Memory and other means . This is also the deficiency of this accident
  • Observability of services is very important , Whether it's a log 、 Monitoring or other , Everything should be complete . Basic, such as log 、 exit 、 Import request monitoring 、 Machine index ( Memory 、CPU、 Network, etc )、JVM monitor ( Thread pool 、GC etc. ). This is OK , There are basically everything that should be
  • Open source products , You can search the network from the Key log , The problems you encounter with great probability have also been encountered by everyone . This is also the lucky point this time , A lot of detours

WeChat official account " Master bug catcher ", Back end technology sharing , Architecture design 、 performance optimization 、 Source code reading 、 Troubleshoot problems 、 Step on the pit practice .

- END -
Please bring the original link to reprint ,thank
Similar articles