CTS问题分析10

CTS/GTS 问题分析10

一个CTS框架自身引起的问题

问题初探

测试命令: run retry --retry 1 --shard-count 6 -s 45f971f -s 5b01edc -s 9485acea -s 9485acda -s 948bac82 -s 948bac62

报错堆栈:

10-31 11:30:00 E/CommandScheduler: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.HashMap.newNode(HashMap.java:1747)
at java.util.HashMap.putVal(HashMap.java:631)
at java.util.HashMap.put(HashMap.java:612)
at java.util.HashSet.add(HashSet.java:220)
at java.util.AbstractCollection.addAll(AbstractCollection.java:344)
at com.android.tradefed.config.OptionSetter.setFieldValue(OptionSetter.java:452)
at com.android.tradefed.config.OptionSetter.setFieldValue(OptionSetter.java:549)
at com.android.tradefed.config.OptionCopier.copyOptions(OptionCopier.java:49)
at com.android.tradefed.config.OptionCopier.copyOptionsNoThrow(OptionCopier.java:60)
at com.android.tradefed.testtype.suite.ITestSuite.split(ITestSuite.java:662)
at com.android.compatibility.common.tradefed.testtype.retry.RetryFactoryTest.split(RetryFactoryTest.java:184)
at com.android.tradefed.invoker.shard.ShardHelper.shardTest(ShardHelper.java:230)
at com.android.tradefed.invoker.shard.ShardHelper.shardConfig(ShardHelper.java:86)
at com.android.tradefed.invoker.shard.StrictShardHelper.shardConfig(StrictShardHelper.java:51)
at com.android.tradefed.invoker.InvocationExecution.shardConfig(InvocationExecution.java:149)
at com.android.tradefed.invoker.TestInvocation.invoke(TestInvocation.java:656)
at com.android.tradefed.command.CommandScheduler$InvocationThread.run(CommandScheduler.java:566)

首先,这是一个频繁full gc,但是并没有回收多少内存造成的报错,我们将gc日志也打出来

尝试将tools中的cts-tradefed里面改为

java $RDBG_FLAG -Xmx4g -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -cp ${JAR_PATH} -DCTS_ROOT=${CTS_ROOT} com.android.compatibility.common.tradefed.command.CompatibilityConsole "$@"

重新复现,打出如下log:

run retry --retry 1 --shard-count 6 -s 45f971f -s 5b01edc -s 9485acea -s 9485acda -s 948bac82 -s 948bac62
cts-tf > [GC (Allocation Failure) [PSYoungGen: 69229K->8168K(74752K)] 69237K->8256K(245760K), 0.0052029 secs] [Times: user=0.02 sys=0.00, real=0.00 secs]
[GC (Allocation Failure) [PSYoungGen: 72680K->8632K(74752K)] 72768K->8728K(245760K), 0.0058188 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]
[GC (Allocation Failure) [PSYoungGen: 73144K->10232K(139264K)] 73240K->11797K(310272K), 0.0099151 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]
[GC (Allocation Failure) [PSYoungGen: 139256K->10216K(139264K)] 140821K->17685K(310272K), 0.0056568 secs] [Times: user=0.03 sys=0.00, real=0.00 secs]
[GC (Allocation Failure) [PSYoungGen: 139240K->17240K(272896K)] 146709K->24717K(443904K), 0.0097234 secs] [Times: user=0.07 sys=0.00, real=0.02 secs]
[GC (Allocation Failure) [PSYoungGen: 272216K->9303K(275968K)] 279693K->21077K(446976K), 0.0250198 secs] [Times: user=0.14 sys=0.01, real=0.02 secs]
[GC (Allocation Failure) [PSYoungGen: 264279K->18535K(527872K)] 276053K->33526K(698880K), 0.0108739 secs] [Times: user=0.04 sys=0.02, real=0.01 secs]
[GC (Allocation Failure) [PSYoungGen: 526439K->6775K(529920K)] 541430K->27020K(700928K), 0.0066831 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
[GC (Allocation Failure) [PSYoungGen: 514679K->21239K(832512K)] 534924K->41500K(1003520K), 0.0085360 secs] [Times: user=0.03 sys=0.02, real=0.01 secs]
[GC (Allocation Failure) [PSYoungGen: 831735K->22703K(835072K)] 851996K->55692K(1006080K), 0.0237944 secs] [Times: user=0.11 sys=0.02, real=0.03 secs]
[GC (Allocation Failure) [PSYoungGen: 833199K->2509K(1323520K)] 866188K->57568K(1494528K), 0.0079843 secs] [Times: user=0.04 sys=0.01, real=0.01 secs]
[GC (Allocation Failure) [PSYoungGen: 1298893K->2224K(1324032K)] 1353952K->59283K(1495040K), 0.0044128 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
[GC (Allocation Failure) [PSYoungGen: 1298608K->5632K(1369088K)] 1355667K->64823K(1540096K), 0.0037587 secs] [Times: user=0.01 sys=0.01, real=0.00 secs]
[GC (Allocation Failure) [PSYoungGen: 1348096K->8516K(1369600K)] 1407287K->73123K(1540608K), 0.0105542 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
[GC (Allocation Failure) [PSYoungGen: 1350980K->26096K(1204736K)] 1415587K->304733K(1483776K), 0.1134238 secs] [Times: user=0.70 sys=0.10, real=0.11 secs]
[Full GC (Ergonomics) [PSYoungGen: 26096K->7141K(1204736K)] [ParOldGen: 278637K->278802K(463360K)] 304733K->285943K(1668096K), [Metaspace: 19544K->19544K(1067008K)], 0.3347239 secs] [Times: user=2.19 sys=0.01, real=0.33 secs]
[GC (Allocation Failure) [PSYoungGen: 1185765K->109552K(1288192K)] 1464567K->506492K(1751552K), 0.1187568 secs] [Times: user=0.78 sys=0.03, real=0.12 secs]
[Full GC (Ergonomics) [PSYoungGen: 109552K->41822K(1288192K)] [ParOldGen: 396940K->463058K(751616K)] 506492K->504881K(2039808K), [Metaspace: 19544K->19544K(1067008K)], 0.7046475 secs] [Times: user=5.09 sys=0.02, real=0.71 secs]
[GC (Metadata GC Threshold) [PSYoungGen: 407656K->69930K(1171456K)] 870714K->575073K(1923072K), 0.0845791 secs] [Times: user=0.50 sys=0.02, real=0.08 secs]
[Full GC (Metadata GC Threshold) [PSYoungGen: 69930K->0K(1171456K)] [ParOldGen: 505142K->573390K(946176K)] 575073K->573390K(2117632K), [Metaspace: 33176K->33176K(1079296K)], 0.9029356 secs] [Times: user=6.67 sys=0.04, real=0.90 secs]
[GC (Allocation Failure) [PSYoungGen: 1061888K->18008K(1229824K)] 1635278K->591406K(2176000K), 0.0101945 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
[GC (Metadata GC Threshold) [PSYoungGen: 847656K->24174K(1211392K)] 1421054K->597580K(2157568K), 0.0188445 secs] [Times: user=0.11 sys=0.00, real=0.01 secs]
[Full GC (Metadata GC Threshold) [PSYoungGen: 24174K->0K(1211392K)] [ParOldGen: 573406K->586836K(1096192K)] 597580K->586836K(2307584K), [Metaspace: 55555K->55555K(1099776K)], 0.3028041 secs] [Times: user=2.13 sys=0.00, real=0.31 secs]
[GC (Allocation Failure) [PSYoungGen: 1067008K->156134K(1223168K)] 1653844K->1121214K(2319360K), 0.1780735 secs] [Times: user=1.21 sys=0.11, real=0.18 secs]
[Full GC (Ergonomics) [PSYoungGen: 156134K->19416K(1223168K)] [ParOldGen: 965079K->1096069K(1870336K)] 1121214K->1115486K(3093504K), [Metaspace: 70120K->70120K(1112064K)], 2.1907503 secs] [Times: user=11.86 sys=0.06, real=2.19 secs]
[GC (Allocation Failure) [PSYoungGen: 1086424K->174561K(640512K)] 2182494K->1840018K(2510848K), 0.3123986 secs] [Times: user=2.16 sys=0.18, real=0.31 secs]
[Full GC (Ergonomics) [PSYoungGen: 174561K->0K(640512K)] [ParOldGen: 1665456K->1832562K(2796544K)] 1840018K->1832562K(3437056K), [Metaspace: 70120K->70120K(1112064K)], 2.0454693 secs] [Times: user=8.06 sys=0.09, real=2.05 secs]
[GC (Allocation Failure) [PSYoungGen: 465920K->325538K(931840K)] 2298482K->2158101K(3728384K), 0.0888857 secs] [Times: user=0.66 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 791458K->318690K(931840K)] 2624021K->2477357K(3728384K), 0.1941567 secs] [Times: user=1.38 sys=0.07, real=0.20 secs]
[GC (Allocation Failure) [PSYoungGen: 784610K->324642K(931840K)] 2943277K->2802237K(3728384K), 0.1914071 secs] [Times: user=1.33 sys=0.13, real=0.19 secs]
[Full GC (Ergonomics) [PSYoungGen: 324642K->3579K(931840K)] [ParOldGen: 2477595K->2796122K(2796544K)] 2802237K->2799702K(3728384K), [Metaspace: 70120K->70120K(1112064K)], 4.2754003 secs] [Times: user=21.15 sys=0.16, real=4.27 secs]
[Full GC (Ergonomics) [PSYoungGen: 469499K->317748K(931840K)] [ParOldGen: 2796122K->2796308K(2796544K)] 3265622K->3114057K(3728384K), [Metaspace: 70120K->70120K(1112064K)], 2.1786764 secs] [Times: user=13.12 sys=0.02, real=2.18 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->421125K(931840K)] [ParOldGen: 2796308K->2796308K(2796544K)] 3262228K->3217434K(3728384K), [Metaspace: 70120K->70120K(1112064K)], 1.8483104 secs] [Times: user=12.65 sys=0.02, real=1.85 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->452682K(931840K)] [ParOldGen: 2796308K->2796308K(2796544K)] 3262228K->3248990K(3728384K), [Metaspace: 70120K->70120K(1112064K)], 1.8304428 secs] [Times: user=12.95 sys=0.03, real=1.84 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->453492K(931840K)] [ParOldGen: 2796308K->2796308K(2796544K)] 3262228K->3249801K(3728384K), [Metaspace: 70120K->70120K(1112064K)], 1.8404874 secs] [Times: user=13.18 sys=0.01, real=1.84 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->459865K(931840K)] [ParOldGen: 2796308K->2796308K(2796544K)] 3262228K->3256173K(3728384K), [Metaspace: 70120K->70120K(1112064K)], 5.8066516 secs] [Times: user=40.27 sys=0.05, real=5.81 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->460566K(931840K)] [ParOldGen: 2796308K->2796308K(2796544K)] 3262228K->3256874K(3728384K), [Metaspace: 70120K->70120K(1112064K)], 1.8195868 secs] [Times: user=13.06 sys=0.00, real=1.82 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->460623K(931840K)] [ParOldGen: 2796308K->2796308K(2796544K)] 3262228K->3256931K(3728384K), [Metaspace: 70120K->70120K(1112064K)], 1.6723615 secs] [Times: user=11.94 sys=0.02, real=1.67 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->464483K(931840K)] [ParOldGen: 2796308K->2796308K(2796544K)] 3262228K->3260791K(3728384K), [Metaspace: 70120K->70120K(1112064K)], 1.8485084 secs] [Times: user=13.24 sys=0.00, real=1.85 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->464888K(931840K)] [ParOldGen: 2796308K->2796308K(2796544K)] 3262228K->3261196K(3728384K), [Metaspace: 70120K->70120K(1112064K)], 1.8069882 secs] [Times: user=12.92 sys=0.00, real=1.81 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465369K(931840K)] [ParOldGen: 2796308K->2796308K(2796544K)] 3262228K->3261677K(3728384K), [Metaspace: 70120K->70120K(1112064K)], 1.7872715 secs] [Times: user=12.79 sys=0.01, real=1.78 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465582K(931840K)] [ParOldGen: 2796308K->2796308K(2796544K)] 3262228K->3261890K(3728384K), [Metaspace: 70120K->70120K(1112064K)], 1.5752262 secs] [Times: user=11.10 sys=0.01, real=1.57 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465733K(931840K)] [ParOldGen: 2796308K->2796308K(2796544K)] 3262228K->3262041K(3728384K), [Metaspace: 70120K->70120K(1112064K)], 1.6507156 secs] [Times: user=11.71 sys=0.01, real=1.65 secs]
[Full GC (Ergonomics) [PSYoungGen: 465907K->465771K(931840K)] [ParOldGen: 2796308K->2796308K(2796544K)] 3262215K->3262080K(3728384K), [Metaspace: 70120K->70120K(1112064K)], 1.5695637 secs] [Times: user=11.09 sys=0.01, real=1.56 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465822K(931840K)] [ParOldGen: 2796308K->2796308K(2796544K)] 3262228K->3262130K(3728384K), [Metaspace: 70120K->70120K(1112064K)], 1.5955533 secs] [Times: user=11.32 sys=0.00, real=1.60 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465887K(931840K)] [ParOldGen: 2796308K->2796308K(2796544K)] 3262228K->3262195K(3728384K), [Metaspace: 70120K->70120K(1112064K)], 1.5837765 secs] [Times: user=11.24 sys=0.00, real=1.58 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465896K(931840K)] [ParOldGen: 2796308K->2796308K(2796544K)] 3262228K->3262204K(3728384K), [Metaspace: 70120K->70120K(1112064K)], 1.7875755 secs] [Times: user=12.81 sys=0.02, real=1.79 secs]
[Full GC (Allocation Failure) [PSYoungGen: 465896K->454120K(931840K)] [ParOldGen: 2796308K->2796201K(2796544K)] 3262204K->3250322K(3728384K), [Metaspace: 70120K->70015K(1112064K)], 2.7476680 secs] [Times: user=16.59 sys=0.03, real=2.75 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->455662K(931840K)] [ParOldGen: 2796219K->2796179K(2796544K)] 3262139K->3251841K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 9.0061886 secs] [Times: user=64.63 sys=0.22, real=9.01 secs]
[Full GC (Ergonomics) [PSYoungGen: 465917K->457887K(931840K)] [ParOldGen: 2796179K->2796163K(2796544K)] 3262096K->3254050K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.5904952 secs] [Times: user=11.19 sys=0.01, real=1.59 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->463660K(931840K)] [ParOldGen: 2796163K->2796163K(2796544K)] 3262083K->3259823K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.6110997 secs] [Times: user=11.38 sys=0.01, real=1.61 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465113K(931840K)] [ParOldGen: 2796163K->2796163K(2796544K)] 3262083K->3261276K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.6458507 secs] [Times: user=11.83 sys=0.01, real=1.64 secs]
[Full GC (Ergonomics) [PSYoungGen: 465806K->465719K(931840K)] [ParOldGen: 2796163K->2796163K(2796544K)] 3261969K->3261883K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.6579255 secs] [Times: user=11.79 sys=0.01, real=1.66 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465720K(931840K)] [ParOldGen: 2796422K->2796420K(2796544K)] 3262342K->3262141K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 8.4967094 secs] [Times: user=60.98 sys=0.06, real=8.50 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465721K(931840K)] [ParOldGen: 2796420K->2796419K(2796544K)] 3262340K->3262140K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.8118774 secs] [Times: user=12.98 sys=0.00, real=1.81 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465593K(931840K)] [ParOldGen: 2796419K->2796419K(2796544K)] 3262339K->3262013K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.5768498 secs] [Times: user=11.15 sys=0.01, real=1.58 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465618K(931840K)] [ParOldGen: 2796419K->2796419K(2796544K)] 3262339K->3262037K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.7890317 secs] [Times: user=12.83 sys=0.00, real=1.79 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465641K(931840K)] [ParOldGen: 2796419K->2796419K(2796544K)] 3262339K->3262061K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.5872775 secs] [Times: user=11.27 sys=0.01, real=1.58 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465642K(931840K)] [ParOldGen: 2796419K->2796419K(2796544K)] 3262339K->3262061K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.5884750 secs] [Times: user=11.28 sys=0.02, real=1.59 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465675K(931840K)] [ParOldGen: 2796419K->2796419K(2796544K)] 3262339K->3262094K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.5860467 secs] [Times: user=11.21 sys=0.01, real=1.59 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465712K(931840K)] [ParOldGen: 2796419K->2796419K(2796544K)] 3262339K->3262131K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.5936147 secs] [Times: user=11.33 sys=0.01, real=1.59 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465709K(931840K)] [ParOldGen: 2796419K->2796419K(2796544K)] 3262339K->3262128K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.6066747 secs] [Times: user=11.41 sys=0.01, real=1.60 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465887K(931840K)] [ParOldGen: 2796419K->2796419K(2796544K)] 3262339K->3262307K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.5937984 secs] [Times: user=11.28 sys=0.01, real=1.60 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465919K(931840K)] [ParOldGen: 2796419K->2796419K(2796544K)] 3262339K->3262339K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.6830851 secs] [Times: user=11.99 sys=0.01, real=1.68 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465919K(931840K)] [ParOldGen: 2796419K->2796419K(2796544K)] 3262339K->3262339K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.5858877 secs] [Times: user=11.23 sys=0.01, real=1.58 secs]
[Full GC (Ergonomics) [PSYoungGen: 465919K->465919K(931840K)] [ParOldGen: 2796421K->2796421K(2796544K)] 3262341K->3262341K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.6499274 secs] [Times: user=11.71 sys=0.01, real=1.66 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465919K(931840K)] [ParOldGen: 2796423K->2796423K(2796544K)] 3262343K->3262343K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.5922608 secs] [Times: user=11.29 sys=0.00, real=1.59 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465919K(931840K)] [ParOldGen: 2796423K->2796423K(2796544K)] 3262343K->3262343K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.5766683 secs] [Times: user=11.18 sys=0.02, real=1.57 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465920K(931840K)] [ParOldGen: 2796425K->2796425K(2796544K)] 3262345K->3262345K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.5808111 secs] [Times: user=11.21 sys=0.00, real=1.58 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465920K(931840K)] [ParOldGen: 2796427K->2796427K(2796544K)] 3262347K->3262347K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.6686178 secs] [Times: user=11.79 sys=0.01, real=1.68 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465920K(931840K)] [ParOldGen: 2796429K->2796429K(2796544K)] 3262349K->3262349K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.5804707 secs] [Times: user=11.21 sys=0.01, real=1.58 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465920K(931840K)] [ParOldGen: 2796431K->2796431K(2796544K)] 3262351K->3262351K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.5736846 secs] [Times: user=11.17 sys=0.01, real=1.57 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465920K(931840K)] [ParOldGen: 2796433K->2796433K(2796544K)] 3262353K->3262353K(3728384K), [Metaspace: 70019K->70019K(1112064K)], 1.6448586 secs] [Times: user=11.74 sys=0.00, real=1.65 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465903K(931840K)] [ParOldGen: 2796435K->2796435K(2796544K)] 3262355K->3262339K(3728384K), [Metaspace: 70019K->70015K(1112064K)], 1.5992428 secs] [Times: user=11.32 sys=0.00, real=1.59 secs]
java.lang.OutOfMemoryError: GC overhead limit exceeded
Dumping heap to java_pid14450.hprof ...
Heap dump file created [5638377967 bytes in 32.001 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->465904K(931840K)] [ParOldGen: 2796435K->2796434K(2796544K)] 3262355K->3262338K(3728384K), [Metaspace: 70015K->70015K(1112064K)], 1.7980229 secs] [Times: user=12.93 sys=0.03, real=1.79 secs]
[Full GC (Ergonomics) [PSYoungGen: 465920K->0K(931840K)] [ParOldGen: 2796434K->9739K(1840128K)] 3262354K->9739K(2771968K), [Metaspace: 70015K->70015K(1112064K)], 0.0714630 secs] [Times: user=0.14 sys=0.03, real=0.07 secs]
cts-tf >
10-31 11:30:00 E/CommandScheduler: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.HashMap.newNode(HashMap.java:1747)
at java.util.HashMap.putVal(HashMap.java:631)
at java.util.HashMap.put(HashMap.java:612)
at java.util.HashSet.add(HashSet.java:220)
at java.util.AbstractCollection.addAll(AbstractCollection.java:344)
at com.android.tradefed.config.OptionSetter.setFieldValue(OptionSetter.java:452)
at com.android.tradefed.config.OptionSetter.setFieldValue(OptionSetter.java:549)
at com.android.tradefed.config.OptionCopier.copyOptions(OptionCopier.java:49)
at com.android.tradefed.config.OptionCopier.copyOptionsNoThrow(OptionCopier.java:60)
at com.android.tradefed.testtype.suite.ITestSuite.split(ITestSuite.java:662)
at com.android.compatibility.common.tradefed.testtype.retry.RetryFactoryTest.split(RetryFactoryTest.java:184)
at com.android.tradefed.invoker.shard.ShardHelper.shardTest(ShardHelper.java:230)
at com.android.tradefed.invoker.shard.ShardHelper.shardConfig(ShardHelper.java:86)
at com.android.tradefed.invoker.shard.StrictShardHelper.shardConfig(StrictShardHelper.java:51)
at com.android.tradefed.invoker.InvocationExecution.shardConfig(InvocationExecution.java:149)
at com.android.tradefed.invoker.TestInvocation.invoke(TestInvocation.java:656)
at com.android.tradefed.command.CommandScheduler$InvocationThread.run(CommandScheduler.java:566)

GC overhead limt exceed检查是Hotspot VM 1.6定义的一个策略,通过统计GC时间来预测是否要OOM了,提前抛出异常,防止OOM发生。Sun 官方对此的定义是:“并行/并发回收器在GC回收时间过长时会抛出OutOfMemroyError。过长的定义是,超过98%的时间用来做GC并且回收了不到2%的堆内存。用来避免内存过小造成应用不能正常工作。“

而我们发现在抛出异常之前,确实频繁出现了full gc的日志,下面选取两个典型日志看一下含义:

[GC (Allocation Failure) [PSYoungGen: 1350980K->26096K(1204736K)] 1415587K->304733K(1483776K), 0.1134238 secs] [Times: user=0.70 sys=0.10, real=0.11 secs]
这个属于minor gc的日志:

  1. GC表示这是一次Minor GC(新生代垃圾收集)

2.[PSYoungGen: 1350980K->26096K(1204736K)] PSYoungGen表示新生代使用的是多线程垃圾收集器Parallel Scavenge,1350980K为gc前新生代占用空间的大小,26096K表示gc后占用空间的大小,
1204736K为新生代总大小
注:HotSpot JVM把新生代分为了三部分:1个Eden区和2个Survivor区(分别叫from和to)。默认比例为8:1,为啥默认会是这个比例,接下来我们会聊到。一般情况下,新创建的对象都会被分配到Eden区(一些大对象特殊处理),这些对象经过第一次Minor GC后,如果仍然存活,将会被移到Survivor区。对象在Survivor区中每熬过一次Minor GC,年龄就会增加1岁,当它的年龄增加到一定程度时,就会被移动到年老代中。

3.1415587K->304733K(1483776K) 1415587K: gc前堆占用总空间 304733K: gc后堆占用总空间 1483776K:当前堆的总大小
由()中的两个值可以算出老年代的大小,如本例中为: 1483776K - 1204736K

4.0.1134238 secs 本次gc消耗时间

  1. [Times: user=0.70 sys=0.10, real=0.11 secs]

user- 表示在此次垃圾回收过程中,所有GC线程消耗的CPU时间之和
sys - 表示GC过程中操作系统调用和系统等等事件所消耗的时间
real - 应用暂停的总时间。在并行GC中,这个数值应该接近于(user + sys) / GC线程数。即单个核上平均的暂停时间 。由于某些过程是不能并行执行的,所以这个值会比刚才求的均值略高

[Full GC (Ergonomics) [PSYoungGen: 465920K->465903K(931840K)] [ParOldGen: 2796435K->2796435K(2796544K)] 3262355K->3262339K(3728384K), [Metaspace: 70019K->70015K(1112064K)], 1.5992428 secs] [Times: user=11.32 sys=0.00, real=1.59 secs]

代表full gc,大体的格式类同,以三个数字为一组,分别代表新生代,老年代,总堆以及元数据区;从上面的log可以看出,报错前的full gc耗时1.5s,但并没有回收多少空间,最后导致错误:

问题分析

从log中我们可以看到这一行log,Dumping heap to java_pid14450.hprof ... 那么就可以用MAT分析得到的hprof文件

OverView.png

Dominator Tree.png

Histogram.png

Path To GC Roots -> exclude weak/soft references.png

从上面的图中可以看到CommandScheduler中的InvocationThread中的run方法中持有大量的局部变量会导致最终的gc overhead limit exceceeded.

注意: Histogram是从类的角度来总结对上内存分配状况,Dominator Tree可以是从实例的角度来总结,Path To GC Roots 可以看到引用路径,导致对象没有被gc的原因

大致的引用路径如下:

496    private class InvocationThread extends Thread {
           ...
514
515        public InvocationThread(String name, IInvocationContext invocationContext,
516                ExecutableCommand command, IScheduledInvocationListener... listeners) {
517            // create a thread group so LoggerRegistry can identify this as an invocationThread
518            super(new ThreadGroup(name), name);
519            mListeners = listeners;
520            mInvocationContext = invocationContext;
521            mCmd = command;
522            mInvocation = createRunInstance();
523
524            // Daemon timer
525            mExecutionTimer = new Timer(true);
526            mInvocationThreadMonitor = new InvocationThreadMonitor(this);
527        }
528
529        public long getStartTime() {
530            return mStartTime;
531        }
532
533        @Override
534        public void run() {
                   ...
566                instance.invoke(mInvocationContext, config,
567                        new Rescheduler(mCmd.getCommandTracker()), mListeners);
596    @Override
597    public void invoke(
598            IInvocationContext context,
599            IConfiguration config,
600            IRescheduler rescheduler,
601            ITestInvocationListener... extraListeners)
602            throws DeviceNotAvailableException, Throwable {
603        List<ITestInvocationListener> allListeners =
604                new ArrayList<>(config.getTestInvocationListeners().size() + extraListeners.length);
605        allListeners.addAll(config.getTestInvocationListeners());
606        allListeners.addAll(Arrays.asList(extraListeners));
607        ITestInvocationListener listener =
608                new LogSaverResultForwarder(config.getLogSaver(), allListeners);
609        IInvocationExecution invocationPath =
610                createInvocationExec(config.getConfigurationDescription().shouldUseSandbox());
611
612        // Create the Guice scope
613        InvocationScope scope = getInvocationScope();
614        scope.enter();
615        // Seed our TF objects to the Guice scope
616        scope.seedConfiguration(config);
617        try {
618            mStatus = "fetching build";
619            config.getLogOutput().init();
620            getLogRegistry().registerLogger(config.getLogOutput());
621            for (String deviceName : context.getDeviceConfigNames()) {
622                context.getDevice(deviceName).clearLastConnectedWifiNetwork();
623                context.getDevice(deviceName)
624                        .setOptions(config.getDeviceConfigByName(deviceName).getDeviceOptions());
625                if (config.getDeviceConfigByName(deviceName)
626                        .getDeviceOptions()
627                        .isLogcatCaptureEnabled()) {
628                    if (!(context.getDevice(deviceName).getIDevice() instanceof StubDevice)) {
629                        context.getDevice(deviceName).startLogcat();
630                    }
631                }
632            }
633
634            String cmdLineArgs = config.getCommandLine();
635            if (cmdLineArgs != null) {
636                CLog.i("Invocation was started with cmd: %s", cmdLineArgs);
637            }
638
639            long start = System.currentTimeMillis();
640            boolean providerSuccess =
641                    invokeFetchBuild(context, config, rescheduler, listener, invocationPath);
642            long fetchBuildDuration = System.currentTimeMillis() - start;
643            context.addInvocationTimingMetric(IInvocationContext.TimingEvent.FETCH_BUILD,
644                    fetchBuildDuration);
645            CLog.d("Fetch build duration: %s", TimeUtil.formatElapsedTime(fetchBuildDuration));
646            if (!providerSuccess) {
647                return;
648            }
649
650            mStatus = "sharding";
651            boolean sharding = invocationPath.shardConfig(config, context, rescheduler);
               ...
}
196    /**
197     * Attempt to shard given {@link IRemoteTest}.
198     *
199     * @param shardableTests the list of {@link IRemoteTest}s to add to
200     * @param test the {@link IRemoteTest} to shard
201     * @param shardCount attempted number of shard, can be null.
202     * @param context the {@link IInvocationContext} of the current invocation.
203     * @return <code>true</code> if test was sharded
204     */
205    private static boolean shardTest(
206            List<IRemoteTest> shardableTests,
207            IRemoteTest test,
208            Integer shardCount,
209            IInvocationContext context) {
210        boolean isSharded = false;
211        if (test instanceof IShardableTest) {
212            // inject device and build since they might be required to shard.
213            if (test instanceof IBuildReceiver) {
214                ((IBuildReceiver) test).setBuild(context.getBuildInfos().get(0));
215            }
216            if (test instanceof IDeviceTest) {
217                ((IDeviceTest) test).setDevice(context.getDevices().get(0));
218            }
219            if (test instanceof IMultiDeviceTest) {
220                ((IMultiDeviceTest) test).setDeviceInfos(context.getDeviceBuildMap());
221            }
222            if (test instanceof IInvocationContextReceiver) {
223                ((IInvocationContextReceiver) test).setInvocationContext(context);
224            }
225
226            IShardableTest shardableTest = (IShardableTest) test;
227            Collection<IRemoteTest> shards = null;
228            // Give the shardCount hint to tests if they need it.
229            if (shardCount != null) {
230                shards = shardableTest.split(shardCount);
231            } else {
232                shards = shardableTest.split();
233            }
234            if (shards != null) {
235                shardableTests.addAll(shards);
236                isSharded = true;
237            }
238        }
239        if (!isSharded) {
240            shardableTests.add(test);
241        }
242        return isSharded;
243    }
180    @Override
181    public Collection<IRemoteTest> split(int shardCountHint) {
182        try {
183            CompatibilityTestSuite test = loadSuite();
184            return test.split(shardCountHint);
185        } catch (DeviceNotAvailableException e) {
186            CLog.e("Failed to shard the retry run.");
187            CLog.e(e);
188        }
189        return null;
190    }
621    /** {@inheritDoc} */
622    @Override
623    public Collection<IRemoteTest> split(int shardCountHint) {
624        if (shardCountHint <= 1 || mIsSharded) {
625            // cannot shard or already sharded
626            return null;
627        }
628
629        LinkedHashMap<String, IConfiguration> runConfig = loadAndFilter();
630        if (runConfig.isEmpty()) {
631            CLog.i("No config were loaded. Nothing to run.");
632            return null;
633        }
634        injectInfo(runConfig);
635
636        // We split individual tests on double the shardCountHint to provide better average.
637        // The test pool mechanism prevent this from creating too much overhead.
638        List<ModuleDefinition> splitModules =
639                ModuleSplitter.splitConfiguration(
640                        runConfig, shardCountHint, mShouldMakeDynamicModule);
641        runConfig.clear();
642        runConfig = null;
643        // create an association of one ITestSuite <=> one ModuleDefinition as the smallest
644        // execution unit supported.
645        List<IRemoteTest> splitTests = new ArrayList<>();
646        for (ModuleDefinition m : splitModules) {
647            ITestSuite suite = createInstance();
648            OptionCopier.copyOptionsNoThrow(this, suite);
649            suite.mIsSharded = true;
650            suite.mDirectModule = m;
651            splitTests.add(suite);
652        }
653        // return the list of ITestSuite with their ModuleDefinition assigned
654        return splitTests;
655    }

最终导致持有大量的CompatibilityTestSuite,导致case fail不能6台case一起retry

问题总结

因为是属于google的cts测试框架问题,所以先向google反馈;临时的解决方案,减少retry机器台数,将-Xmx4g改为-Xmx8g

java $RDBG_FLAG -Xmx8g -XX:+HeapDumpOnOutOfMemoryError  -cp ${JAR_PATH} -DCTS_ROOT=${CTS_ROOT} com.android.compatibility.common.tradefed.command.CompatibilityConsole "$@"

环境配置

在上面的临时修改之后,可以在运行时抓一下hprof文件与gc overhead时产生的文件进行对比,但是可能需要一些环境配置,这里记录下我遇到的一些问题,如要查看的是pid = 8154

sudo jmap -heap 8154

结果如下:

Attaching to process ID 8154, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.181-b13
using thread-local object allocation.
Parallel GC with 8 thread(s)
Heap Configuration:
   MinHeapFreeRatio         = 0
   MaxHeapFreeRatio         = 100
   MaxHeapSize              = 4294967296 (4096.0MB)
   NewSize                  = 87031808 (83.0MB)
   MaxNewSize               = 1431306240 (1365.0MB)
   OldSize                  = 175112192 (167.0MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 0 (0.0MB)
Heap Usage:
Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.tools.jmap.JMap.runTool(JMap.java:201)
    at sun.tools.jmap.JMap.main(JMap.java:130)
Caused by: java.lang.RuntimeException: unknown CollectedHeap type : class sun.jvm.hotspot.gc_interface.CollectedHeap
    at sun.jvm.hotspot.tools.HeapSummary.run(HeapSummary.java:144)
    at sun.jvm.hotspot.tools.Tool.startInternal(Tool.java:260)
    at sun.jvm.hotspot.tools.Tool.start(Tool.java:223)
    at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
    at sun.jvm.hotspot.tools.HeapSummary.main(HeapSummary.java:49)
    ... 6 more

因为这里用到了sun.jvm下的debug包,默认openJDK的安装中不包含,要么装个oracle的jdk,要么下载一个openjdk的debuginfo,这里安装正式jdk

1.官网下载压缩文件

  1. 手动解压配置

sudo update-alternatives --install /usr/bin/java java /home/weijuncheng/java8/jdk1.8.0_191/bin/java 700
sudo update-alternatives --install /usr/bin/javac javac /home/weijuncheng/java8/jdk1.8.0_191/bin/javac 700
sudo update-alternatives --install /usr/bin/jar jar /home/weijuncheng/java8/jdk1.8.0_191/bin/jar 700

  1. sudo update-alternatives --config java 切换到jdk 1.8

然后又报错了

Error attaching to process: sun.jvm.hotspot.runtime.VMVersionMismatchException: Supported versions are 25.181-b13. Target VM is 25.191-b12
sun.jvm.hotspot.debugger.DebuggerException: sun.jvm.hotspot.runtime.VMVersionMismatchException: Supported versions are 25.181-b13. Target VM is 25.191-b12
    at sun.jvm.hotspot.HotSpotAgent.setupVM(HotSpotAgent.java:435)
    at sun.jvm.hotspot.HotSpotAgent.go(HotSpotAgent.java:305)
    at sun.jvm.hotspot.HotSpotAgent.attach(HotSpotAgent.java:140)
    at sun.jvm.hotspot.tools.Tool.start(Tool.java:185)
    at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
    at sun.jvm.hotspot.tools.HeapSummary.main(HeapSummary.java:49)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.tools.jmap.JMap.runTool(JMap.java:201)
    at sun.tools.jmap.JMap.main(JMap.java:130)
Caused by: sun.jvm.hotspot.runtime.VMVersionMismatchException: Supported versions are 25.181-b13. Target VM is 25.191-b12
    at sun.jvm.hotspot.runtime.VM.checkVMVersion(VM.java:227)
    at sun.jvm.hotspot.runtime.VM.<init>(VM.java:294)
    at sun.jvm.hotspot.runtime.VM.initialize(VM.java:370)
    at sun.jvm.hotspot.HotSpotAgent.setupVM(HotSpotAgent.java:431)

原因:
cd /etc/alternatives

lrwxrwxrwx 1 root root 45 11月 1 14:41 java -> /home/weijuncheng/java8/jdk1.8.0_191/bin/java
lrwxrwxrwx 1 root root 42 1月 22 2018 jmap -> /usr/lib/jvm/java-8-openjdk-amd64/bin/jmap

两者不一样,所以会报版本不同的错误

sudo update-alternatives --config jmap
选择jdk 1.8

再执行,则没有报错了

最后jmap -dump:live,format=b,file=heap.hprof pid即可为相应pid生成hprof文件

推荐阅读更多精彩内容