应用与系统稳定性第六篇---JVM垃圾回收之finalize执行时引起timed out 闪退分析

一、背景

java.util.concurrent.TimeoutException: android.content.res.AssetManager$AssetInputStream.finalize() timed out after 10 seconds
 at android.content.res.AssetManager$AssetInputStream.close(AssetManager.java:812)
 at android.content.res.AssetManager$AssetInputStream.finalize(AssetManager.java:845)
 at java.lang.Daemons$FinalizerDaemon.doFinalize(Daemons.java:202)
 at java.lang.Daemons$FinalizerDaemon.run(Daemons.java:185)
 at java.lang.Thread.run(Thread.java:833)

这是一个高频问题,必须要诊治一下了,去查看一下这个crash在我们发出去的最新版本上的上报情况,看到所有重写finalize方法的对象,都有可能发生这个异常,堆栈有几十种,其中AssetManager是个大户,发生了289次,ThreadedRenderer发生了33次。其余的类大部分都是我们自己的,比如CursorWindow,SQliteCursor等

二、初步分析

android.content.res.AssetManager$AssetInputStream.finalize() timed out after 10 seconds,从报错的直观意思上来看,是由于finalize方法超时了,下面来模拟一下。

class MyObject extends View {
       public MyObject(Context context) {
           super(context);
       }
       @Override
       protected void finalize() throws Throwable {
           System.out.println("WANG  finalize begain");
           try{
               Thread.sleep(2000);
           }finally {
               super.finalize();
               System.out.println("WANG  finalize end");
           }
       }
   }

view.setOnClickListener(new View.OnClickListener() {
           @Override
           public void onClick(View v) {
               for (int i = 0; i < 1000; i++) {
                  MyObject object= new MyObject(MainActivity.this);
               }
           }
       });

当每次被GC的时候,MyObject可以被正常的回收,如下图。


image.png

看一下FinalizerWatchdogDaemon线程的trace,Sleeping状态,可能抓的时候,正处于sleep。

"FinalizerWatchdogDaemon" daemon prio=5 tid=5 Sleeping
 | group="system" sCount=1 dsCount=0 flags=1 obj=0x12f40350 self=0x6fad268c00
 | sysTid=1347 nice=4 cgrp=default sched=0/0 handle=0x6f95d264f0
 | state=S schedstat=( 118415628 218405939 1140 ) utm=7 stm=4 core=6 HZ=100
 | stack=0x6f95c23000-0x6f95c25000 stackSize=1041KB
 | held mutexes=
 at java.lang.Thread.sleep(Native method)
 - sleeping on <0x0edad885> (a java.lang.Object)
 at java.lang.Thread.sleep(Thread.java:373)
 - locked <0x0edad885> (a java.lang.Object)
 at java.lang.Thread.sleep(Thread.java:314)
 at java.lang.Daemons$FinalizerWatchdogDaemon.sleepFor(Daemons.java:344)
 at java.lang.Daemons$FinalizerWatchdogDaemon.waitForFinalization(Daemons.java:366)
 at java.lang.Daemons$FinalizerWatchdogDaemon.runInternal(Daemons.java:283)
 at java.lang.Daemons$Daemon.run(Daemons.java:105)
 at java.lang.Thread.run(Thread.java:764)

现在修改一下程序,把睡眠2s改成睡眠20s,前台运行一段时间之后,突然崩溃了(和机型相关,有一些新出来的机型,系统已经优化,这个问题就不会再出现),FinalizerWatchdogDaemon线程报错了。

04-23 15:27:57.153 26407-26416/com.example.myapplication E/AndroidRuntime: FATAL EXCEPTION: FinalizerWatchdogDaemon
   Process: com.example.myapplication, PID: 26407
   java.util.concurrent.TimeoutException: com.example.myapplication.MainActivity$MyObject.finalize() timed out after 10 seconds
       at java.lang.Thread.sleep(Native Method)
       at java.lang.Thread.sleep(Thread.java:1031)
       at java.lang.Thread.sleep(Thread.java:985)
       at com.example.myapplication.MainActivity$MyObject.finalize(MainActivity.java:36)
       at java.lang.Daemons$FinalizerDaemon.doFinalize(Daemons.java:202)
       at java.lang.Daemons$FinalizerDaemon.run(Daemons.java:185)
       at java.lang.Thread.run(Thread.java:818)

所以我们知道,这个crash和finalize的耗时有一定关系,那我们在看看我们的代码是怎么写的?

256      /**
257       * Release the native resources, if they haven't been released yet.
258       */
259      @Override
260      protected void finalize() {
261          try {
262              // if the cursor hasn't been closed yet, close it first
263              if (mWindow != null) {
264                  close();
265              }
266          } finally {
267              super.finalize();
268          }
269      }
138      @Override
139      protected void finalize() throws Throwable {
140          try {
141              dispose();
142          } finally {
143              super.finalize();
144          }
145      }
146  
147      private void dispose() {
148          if (mWindowPtr != 0) {
149              recordClosingOfWindow(mWindowPtr);
150              nativeDispose(mWindowPtr);
151              mWindowPtr = 0;
152          }
153      }
154  

由此可见,finalize中都做一些额外的工作,正是这些额外的任务,在系统资源紧张,GC的时候很繁忙导致Finalizer对象回收超时,触发了App的crash。

三、深入分析

FinalizerWatchdogDaemon线程顾名思义,带有一个watchdog,说明和一个看门狗的性质是一样的,超过一定的时候不喂狗,就会被狗咬,看看源码是不是这个样子的。

http://androidxref.com/9.0.0_r3/xref/libcore/libart/src/main/java/java/lang/Daemons.java#37

30/**
31 * Calls Object.finalize() on objects in the finalizer reference queue. The VM
32 * will abort if any finalize() call takes more than the maximum finalize time
33 * to complete.
34 *
35 * @hide
36 */
37public final class Daemons {
38    private static final int NANOS_PER_MILLI = 1000 * 1000;
39    private static final int NANOS_PER_SECOND = NANOS_PER_MILLI * 1000;
40    private static final long MAX_FINALIZE_NANOS = 10L * NANOS_PER_SECOND;
41
       //开启四个线程
42    public static void start() {
43        ReferenceQueueDaemon.INSTANCE.start();
44        FinalizerDaemon.INSTANCE.start();
45        FinalizerWatchdogDaemon.INSTANCE.start();
46        HeapTaskDaemon.INSTANCE.start();
47    }
48
55    //停止4个线程
56    public static void stop() {
57        HeapTaskDaemon.INSTANCE.stop();
58        ReferenceQueueDaemon.INSTANCE.stop();
59        FinalizerDaemon.INSTANCE.stop();
60        FinalizerWatchdogDaemon.INSTANCE.stop();
61    }
   ......
62}

Daemons类的开头注释很清楚,从finalizer reference queue(终结引用队列)中取出对象,调用对象的finalize()方法的时候.如果超出了最大终结时间(一般为MAX_FINALIZE_NANOS,值是10秒),JVM就会中止。这个类中开启了4个线程。

FinalizerDaemon:对于重写了成员函数finalize的对象,它们被GC决定回收时,并没有马上被回收,而是被放入到一个队列中,等待FinalizerDaemon守护线程去调用它们的成员函数finalize,然后再被回收。

FinalizerWatchdogDaemon:与上面的线程对应,用来监控FinalizerDaemon线程的执行。一旦检测那些重写了finalize的对象在执行成员函数finalize时超出一定时间,那么就会退出JVM。

我们来看看FinalizerDaemon线程的实现。

193    private static class FinalizerDaemon extends Daemon {
194        private static final FinalizerDaemon INSTANCE = new FinalizerDaemon();
195        private final ReferenceQueue<Object> queue = FinalizerReference.queue;
196        private final AtomicInteger progressCounter = new AtomicInteger(0);
197        // Object (not reference!) being finalized. Accesses may race!
198        private Object finalizingObject = null;
199
200        FinalizerDaemon() {
201            super("FinalizerDaemon");
202        }
203
204        @Override public void runInternal() {
205            // This loop may be performance critical, since we need to keep up with mutator
206            // generation of finalizable objects.
207            // We minimize the amount of work we do per finalizable object. For example, we avoid
208            // reading the current time here, since that involves a kernel call per object.  We
209            // limit fast path communication with FinalizerWatchDogDaemon to what's unavoidable: A
210            // non-volatile store to communicate the current finalizable object, e.g. for
211            // reporting, and a release store (lazySet) to a counter.
212            // We do stop the  FinalizerWatchDogDaemon if we have nothing to do for a
213            // potentially extended period.  This prevents the device from waking up regularly
214            // during idle times.
215
216            // Local copy of progressCounter; saves a fence per increment on ARM and MIPS.
217            int localProgressCounter = progressCounter.get();
218
219            while (isRunning()) {
220                try {
221                    // Use non-blocking poll to avoid FinalizerWatchdogDaemon communication
222                    // when busy.
223                    FinalizerReference<?> finalizingReference = (FinalizerReference<?>)queue.poll();
224                    if (finalizingReference != null) {
225                        finalizingObject = finalizingReference.get();
226                        progressCounter.lazySet(++localProgressCounter);
227                    } else {
228                        finalizingObject = null;
229                        progressCounter.lazySet(++localProgressCounter);
230                        // Slow path; block.
231                        FinalizerWatchdogDaemon.INSTANCE.goToSleep();
232                        finalizingReference = (FinalizerReference<?>)queue.remove();
233                        finalizingObject = finalizingReference.get();
234                        progressCounter.set(++localProgressCounter);
235                        FinalizerWatchdogDaemon.INSTANCE.wakeUp();
236                    }
237                    doFinalize(finalizingReference);
238                } catch (InterruptedException ignored) {
239                } catch (OutOfMemoryError ignored) {
240                }
241            }
242        }
243
244        @FindBugsSuppressWarnings("FI_EXPLICIT_INVOCATION")
245        private void doFinalize(FinalizerReference<?> reference) {
246            FinalizerReference.remove(reference);
247            Object object = reference.get();
248            reference.clear();
249            try {
                   //上面从reference取出对象,调用它的finalize
250                object.finalize();
251            } catch (Throwable ex) {
252                // The RI silently swallows these, but Android has always logged.
253                System.logE("Uncaught exception thrown by finalizer", ex);
254            } finally {
255                // Done finalizing, stop holding the object as live.
256                finalizingObject = null;
257            }
258        }

在看看FinalizerWatchdogDaemon线程的实现

261    /**
262     * The watchdog exits the VM if the finalizer ever gets stuck. We consider
263     * the finalizer to be stuck if it spends more than MAX_FINALIZATION_MILLIS
264     * on one instance.
265     */
266    private static class FinalizerWatchdogDaemon extends Daemon {
267        private static final FinalizerWatchdogDaemon INSTANCE = new FinalizerWatchdogDaemon();
268
269        private boolean needToWork = true;  // Only accessed in synchronized methods.
270
271        FinalizerWatchdogDaemon() {
272            super("FinalizerWatchdogDaemon");
273        }
274
275        @Override public void runInternal() {
276            while (isRunning()) {
277                if (!sleepUntilNeeded()) {
278                    // We have been interrupted, need to see if this daemon has been stopped.
279                    continue;
280                }
281                final Object finalizing = waitForFinalization();
282                if (finalizing != null && !VMRuntime.getRuntime().isDebuggerActive()) {
283                    finalizerTimedOut(finalizing);
284                    break;
285                }
286            }
287        }

waitForFinalization方法用来监控FinalizerDaemon线程的回收,如果超时waitForFinalization方法返回值就不为空,触发了finalizerTimedOut方法的执行,进程被杀死,JVM退出。

396        private static void finalizerTimedOut(Object object) {
397            // The current object has exceeded the finalization deadline; abort!
398            String message = object.getClass().getName() + ".finalize() timed out after "
399                    + (MAX_FINALIZE_NANOS / NANOS_PER_SECOND) + " seconds";
400            Exception syntheticException = new TimeoutException(message);
401            // We use the stack from where finalize() was running to show where it was stuck.
402            syntheticException.setStackTrace(FinalizerDaemon.INSTANCE.getStackTrace());
403
404            // Send SIGQUIT to get native stack traces.
405            try {
                  //杀死进程
406                Os.kill(Os.getpid(), OsConstants.SIGQUIT);
407                // Sleep a few seconds to let the stack traces print.
408                Thread.sleep(5000);
409            } catch (Exception e) {
410                System.logE("failed to send SIGQUIT", e);
411            } catch (OutOfMemoryError ignored) {
412                // May occur while trying to allocate the exception.
413            }
426            if (Thread.getUncaughtExceptionPreHandler() == null &&
427                    Thread.getDefaultUncaughtExceptionHandler() == null) {
428                // If we have no handler, log and exit.
429                System.logE(message, syntheticException);
430                System.exit(2);
431            }
436            Thread.currentThread().dispatchUncaughtException(syntheticException);
437        }
438    }

从Android 5.0开始,每个View都包含了一个或者多个的Finalizer对象,如果页面的View比较多,甚至还有Activity等对象泄漏的话,那么FinalizerDaemon线程需要检测的对象越来越多,负担可想而知,在低端设备上,有可能就会来不及回收而引起性能和稳定性问题,给了FinalizerWatchdogDaemon线程可乘之机.

四、解决方案

从第三部分我们知道这个问题的RootCasue是对象回收超时,那么修复的方法基本上有两种,一种是破坏FinalizerWatchdogDaemon线程,使之stop。

  public void fix1() {
        try {
            Class clazz = Class.forName("java.lang.Daemons$FinalizerWatchdogDaemon");

            Method method = clazz.getSuperclass().getDeclaredMethod("stop");
            method.setAccessible(true);

            Field field = clazz.getDeclaredField("INSTANCE");
            field.setAccessible(true);

            method.invoke(field.get(null));

        } catch (Throwable e) {
            e.printStackTrace();
        }
    }

这种方案从DEMO中确实可以解决timeout的异常问题,但是风险未知。

另外想法一种是加长时间到60s,如下

 private static final int NANOS_PER_MILLI = 1000 * 1000;
    private static final int NANOS_PER_SECOND = NANOS_PER_MILLI * 1000;
    private static final long MAX_FINALIZE_NANOS = 10L * NANOS_PER_SECOND;

    public void fix2() {
        try {
            Class<?> c = Class.forName("java.lang.Daemons");
            Field maxField = c.getDeclaredField("MAX_FINALIZE_NANOS");
            maxField.setAccessible(true);
            maxField.set(null, MAX_FINALIZE_NANOS * 6);

        } catch (ClassNotFoundException e) {
            e.printStackTrace();
        } catch (NoSuchFieldException e) {
            e.printStackTrace();
        } catch (IllegalAccessException e) {
            e.printStackTrace();
        }
    }

第二种方案我在做ROM的时候,改过Daemons.java的源码,直接改成了60秒,全机型,未发现风险,这可能也是为什么出来的错误有的是timed out after 30 seconds的原因,但是由于MAX_FINALIZE_NANOS是final的string常量,并不能用反射,这种思路在App中行不通。

综上,我修复方案是,对于已经国内ROM已经把MAX_FINALIZE_NANOS改变大于30s的的情况,暂时不动。对于仍然是10s的,stop掉FinalizerWatchdogDaemon线程,为了风险可控,加上了云控开关,可以随时控制关闭FinalizerWatchdogDaemon线程这个功能是否打开。

最后给出个人的几个建议

1、不到万不得已,不要重写finalize方法

2、做好内存优化,减少GC的频繁调用