CMS 垃圾收集算法

基石：三色标记

并发收集算法里，将对象分为三类，默认为白色对象，然后还有着色后的灰色和黑色对象。

白色对象：未被标记的对象，回收时会认为是死对象回收掉。
灰色对象：自身被标记，但其引用字段引用的对象，还未标记（要标记但还没标记）。
黑色对象：自身被标记，而且其引用字段引用的对象也都被标记，所以黑色对象要么引用黑色对象，要么引用灰色对象，不能引用白色对象。

一、本文总结

由于并发标记阶段，标记的活跃对象是相对于第一次 STW 时的活跃对象。并发期间，由于 mutator 操作，可能会导致漏标活跃对象。

漏标情况为，黑色对象引用了灰色对象引用的白色对象，并且这个白色对象失去了所有灰色对象对它的直接或间接引用。

最终标记阶段，在并发标记阶段发生过引用字段修改的所有 card 中，寻找上述漏标情况（HotSpot 的做法是找这些 dirty card 中，自身是黑色对象但引用了灰色对象，然后从灰色对象出发，继续标记）。

老年代的 dirty card 信息需要综合 card table 和 mod union table。

但新生代和 GC Roots 也会引用上述那种白色对象，又由于新生代和 GC Roots 一般引用变化率很高，不划算记录其 dirty 信息，所以在最终标记阶段，干脆直接再走一遍 GC Roots 以及遍历新生代（注意，不要理解成这是在做之前标记的重复工作，因为黑色对象就不会扫描了，直接跳过）。

二、细节展开

基本是 CMS 论文（下载不了的话，走最后的百度云链接）部分内容的翻译，后面成段的英文就是论文的原文。

2.1 四个主要阶段

Initial marking pause. Suspend all mutators and record all objects directly reachable from the roots (globals, stacks, registers) of the system.
Concurrent marking phase. Resume mutator operation. At the same time, initiate a concurrent marking phase, which marks a transitive closure of reachable objects. This closure is not guaranteed to contain all objects reachable at the end of marking, since concurrent updates of reference fields by the mutator may have prevented the marking phase from reaching some live objects. To deal with this complication, the algorithm also arranges to keep track of updates to reference fields in heap objects. This is the only interaction between the mutator and the collector.
Final marking pause. Suspend the mutators once again, and complete the marking phase by marking from the roots, considering modified reference fields in marked objects as additional roots. Since such fields contain the only references that the concurrent marking phase may not have observed, this ensures that the final transitive closure includes all objects reachable at the start of the final marking phase. It may also include some objects that became unreachable after they were marked. These will be collected during the next garbage collection cycle.
Concurrent sweeping phase. Resume the mutators once again, and sweep concurrently over the heap, deallocating unmarked objects. Care must be taken not to deallocate newly-allocated objects. This can be accomplished by allocating objects “live” (i.e., marked), at least during this phase.
初始标记暂停。挂起所有的mutator，并记录从系统的根(全局变量、栈、寄存器)直接访问的所有对象。
并发标记阶段。恢复mutator操作。同时，启动一个并发标记阶段，该阶段标记可达对象的传递闭包。这个闭包不能保证包含标记结束时可以访问的所有对象，因为mutator对引用字段的并发更新可能会阻止标记阶段访问某些活动对象。为了处理这种复杂性，算法还安排跟踪堆对象中引用字段的更新。这是mutator和收集器之间唯一的交互。
最终标记暂停。再次挂起mutator，并通过从根开始标记来完成标记阶段，将标记对象中修改后的引用字段视为附加根。因为这些字段包含了并发标记阶段可能没有观察到的唯一引用，这就确保了最终的传递闭包包含了在最终标记阶段开始时可以访问的所有对象。它还可能包括一些在被标记之后变得不可访问的对象。这些将在下一个垃圾收集周期中收集（即，浮动垃圾）。
并发清除阶段。再次恢复mutator，并发地扫描堆，释放未标记的对象。必须注意不要释放新分配的对象。这可以通过分配“活的”对象来实现。至少在这个阶段是这样的。

mutator 在 java 中，本意是修改对象的引用字段。因为引用字段的读对并发 GC 没有影响，而引用字段的写会影响并发 GC。这里的 mutator 可以理解为用户线程。

在CMS initial mark的上下文里，根集合并不包括young gen而是只有stacks、registers、globals这些常规的。这是因为在接下来的CMS concurrent mark阶段CMS会顺着初始的根集合把young gen里的活对象都遍历了。所以从CMS initial mark + concurrent mark结合在一起的角度看，young gen仍然是根集合的一部分（因为被扫描但不被收集）。

CMS 最终标记阶段就是确保当前的STW中所有的活跃对象都被标记，所以要重新扫描根集合(全局变量、栈、寄存器)，同时也要把card table和mod-union table记录下的在old gen里发生了变化的引用作为附加根也重新扫描一遍。

附加根想要解决的场景是：
黑色对象引用了灰色对象引用的白色对象（即，这个白色对象为活跃对象，不应该被垃圾回收），并且这个白色对象失去了所有灰色对象对它的直接或间接引用。由于 remark 不会扫描黑色对象，所以这个白色对象不可能被标记，导致它以及它依赖的白色对象被垃圾回收。

2.2 card table 和 mod union table

Generational garbage collection requires tracking of references from objects in older generations to objects in younger generations. This is necessary for correctness, since some young-generation objects may be unreachable except through such references. A better scheme than simply traversing the entire older generation is required, since that would make the work of a young-generation collection similar to the work of a collection of the entire heap.
分代垃圾收集需要跟踪从老一代对象到年轻一代对象的引用。这对于正确性是必要的，因为一些年轻代的对象可能无法访问，除非通过这样的引用。需要一种比简单地遍历整个旧代更好的方案，因为这将使年轻代垃圾收集类似于整个堆的垃圾收集。
Several schemes for tracking such old-to-young references have been used, with diﬀerent cost/accuracy tradeoﬀs. The generational framework of the ResearchVM (see Section 2) uses a card table for this tracking [34, 17, 33]. A card table is an array of values, each entry corresponding to a subregion of the heap called a card. The system is arranged so that each update of a reference ﬁeld within a heap object by mutator code executes a write barrier that sets the card table entry corresponding to the card containing the reference ﬁeld to a dirty value.2 In compiled mutator code, the extra code for card table update can be quite eﬃcient: a two-instruction write barrier proposed by H¨olzle [16] is used.
已经使用了几个方案来跟踪这种从老到新的引用，并在成本/准确性上进行了不同的权衡。ResearchVM(论文是在这个 VM 上做的实验)的分代框架使用card table进行跟踪[34,17,33]。card table是一个数组，每个元素对应堆上的一个叫做card的子区域。系统的设置是这样的：每次通过mutator code（Java中，把修改对象实例字段的代码叫mutator code）更新堆对象中的引用字段时，都会执行一个写屏障，将包含引用字段的卡片对应的card表项设置为一个脏值。在被编译的mutator code 中，添加的用于更新card table的代码可以很高效：使用[16]提出的双指令写屏障。

写屏障的注释：
wiki: A write barrier in a garbage collector is a fragment of code emitted by the compiler immediately before every store operation to ensure that (e.g.) generational invariants are maintained.意思是，垃圾回收的写屏障，是编译器在每次（内存）写操作前添加的一段代码，以确保(例如)分代不变式的维护。

CMS的write barrier非常简单，只是在card table记录一下改变的引用的出发端对应的card。

实际上HotSpot VM一般用的post-write barrier非常简单，就是无条件的记录下发生过引用关系变化的card：

void post_write_barrier(oop* field, oop val) {  
    jbyte* card_ptr = card_for(field);  
    *card_ptr = dirty_card;  
}

这里既不关心field所在的分代，也不关心val的值，所以其实只要有引用改变，其对应的card都会被记录。也就是说这个card table记录的不只是old -> young引用，而是所有发生了变化的引用的出发端，无论在old还是young。

但是HotSpot VM的CMS只使用了old gen部分的card table，也就是说只关心old -> ?的引用。这是因为一般认为young gen的引用变化率（mutation rate）非常高，其对应的card table部分可能大部分都是dirty的，要把young gen当作root的时候与其扫描card table还不如直接扫描整个young gen。

结论：card table <主要>用于跟踪从老到新的引用。

Adapting the card table for the needs of the generational mostly-concurrent algorithm was straightforward. In fact, as discussed above, the write barrier and card table data structure were left unchanged. However, we took careful note of the fact that the card table is used in subtly diﬀerent ways by two garbage collection algorithms that may be running simultaneously. The mostly-concurrent algorithm requires tracking of all references updated since the beginning of the current marking phase. Young-generation collection requires identiﬁcation of all old-to-young pointers. In the base generational system, a young-generation collection scans all dirty old-space cards, searching for pointers into the young generation. If none are found, there is no need to scan this card in the next collection, so the card is marked as clean. Before a young-generation collection cleans a dirty card, the information that the card has been modiﬁed must be recorded for the mostly-concurrent collector.
根据分代部分并发算法的需要调整card table是很简单的（意思确实是，card table不是直接用于CMS的）。事实上，正如上面所讨论的，写屏障和card table的语义要保持不变。但是，我们注意到，两个可能同时运行的垃圾收集算法以略微不同的方式使用card table。部分并发算法要求跟踪自当前（并发）标记阶段开始以来更新的所有引用。年轻代垃圾回收需要标识所有从老到年轻的指针。在基本的分代系统中，年轻代回收需要扫描所有脏的旧空间卡，搜索指向年轻代的指针。如果没有找到，就不需要在下一次回收中扫描该卡，因此该卡被标记为 clean（与 dirty 相对）。在年轻代收集清理脏卡之前，必须为部分并发的收集器记录卡被修改的信息。

image.png

This is accomplished by adding a new data structure, the mod union table, shown in Figure 2, which is so-named because it represents the union of the sets of cards modiﬁed between each of the young-generation collections that occur during concurrent marking. The card table itself contains a byte per card in the ResearchVM; this allows a fast write-barrier implementation using a byte store. The mod-union table, on the other hand, is a bit vector with one bit per card. It therefore adds little space overhead beyond the card table, and also enables fast traversal to ﬁnd modiﬁed cards when the table is sparsely populated. We maintain an invariant on the mod union and card tables: any card containing a reference modiﬁed since the beginning of the current concurrent marking phase either has its bit set in the mod union table, or is marked dirty in the card table, or both. This invariant is maintained by young-generation collections, which set the mod union bits for all cards dirty in the card table before scanning those dirty cards.
这通过添加一个新的数据结构实现，mod union table，如图2所示，之所以这样命名是因为它代表发生在并发标记阶段发生的每一次年轻代垃圾收集中被修改cards集合的并集。card table本身在ResearchVM中包含一个字节对应一个card；这允许使用字节存储的快速写屏障实现。另一方面，mod-union table是一个位向量，一个位对应一个card。因此，除了card table之外，它只增加了很少的空间开销，并且还支持在表的稀疏填充时快速遍历查找修改的card。我们在mod union table和card table上维护一个不变式：任何包含自当前并发标记阶段开始以来修改了引用的card，要么在mod union table中设置它的位，要么在card table中标记为dirty，或者两者都是。这个不变式由年轻代垃圾回收维护，它在扫描脏卡片（堆中）之前为所有card table中的设置为dirty的元素设置对应的mod union bits。

结论：
card table只有一份，既要用来支持young GC又要用来支持CMS。每次young GC过程中都涉及重置和重新扫描card table，这样是满足了young GC的需求，但却破坏了CMS的需求——CMS需要的信息可能被young GC给重置掉了。

为了避免丢失信息，就在card table之外另外加了一个bitmap叫做mod-union table。在CMS concurrent marking正在运行的过程中，每当发生一次young GC，当young GC要重置card table里的某个记录时，就会更新mod-union table对应的bit。

2.3 标记对象

Our concurrent garbage collector uses an array of external mark bits. This bitmap contains one bit for every four-byte word in the heap. This use of external mark bits, rather than internal mark bits in object headers, prevents interference between mutator and collector use of the object headers.
我们的并发垃圾收集器使用一个外部标记位的数组。这个位图中的一个位对应堆中一个四字节的字。使用外部标记位而不是对象头中的内部标记位，可以防止mutator和收集器使用对象头时之间的干扰。

Root scanning presents an interesting design choice, since it is inﬂuenced by two competing concerns. As we described in Section 3, the mostlyconcurrent algorithm scans roots while the mutator is suspended. Therefore, we would like this process to be as fast as possible. On the other hand, any marking process requires some representation of the set of objects that have been marked but not yet scanned (henceforth the to-be-scanned set). Often this set is represented using some data structure external to the heap, such as a stack or queue. A strategy that minimizes stop-the-world time is simply to put all objects reachable from the roots in this external data structure. However, since garbage collection is intended to recover memory when that is a scarce resource, the sizes of such external data structures are always important concerns. Since the Java language is multi-threaded, the root set may include the registers and stack frames of many threads. In our generational system, objects in generations other than the one being collected are also considered roots.4 So the root set may indeed be quite large, arguing against this simple strategy.
根扫描提供了一个有趣的设计选择，因为它受到两个相互竞争的关注点的影响。正如我们在第3节中所描述的，部分并发算法在挂起mutator时扫描根。因此，我们希望这个过程尽可能快。另一方面，任何标记过程都需要对已标记但尚未被扫描的对象集(即要扫描的对象集)进行某种表示。这个集合通常使用堆外部的一些数据结构表示，比如栈或队列。最小化停止时间的一种策略是将所有可从根访问的对象放在这个外部数据结构中。但是，由于垃圾收集的目的是在内存稀缺的情况下回收内存，所以外部数据结构的大小总是很重要的问题。由于Java语言是多线程的，根集可能包含许多线程的寄存器和栈帧。在我们的分代系统中，回收一个代时，其他代中的对象也被视为根。因此，根集可能确实相当大，与这个简单的策略相矛盾。

An alternative strategy that minimizes space cost is one that marks all objects reachable from a root immediately on considering the root. Many objects may be reachable from roots, but we place such objects in the to-bescanned set one at a time, minimising the space needed in this data structure (because of roots) at any given time. While suitable for non-concurrent collection, this strategy is incompatible with the mostly-concurrent algorithm, since it accomplishesall marking as part of the root scan.
最小化空间成本的另一种策略是，在考虑根时立即标记从根可访问的所有对象。

We use a compromise between these two approaches. The compromise takes advantage of the use of an external marking bitmap. The root scan simply marks objects directly reachable from the roots. This minimizes the duration of the stop-the-world root scan, and imposes no additional space cost, by using the mark bit vector to represent the to-be-scanned set. The concurrent marking phase, then, consists of a linear traversal of the generation, searching the mark bit vector for live objects. (This process has cost proportional to the heap size rather than amount of live data, but the overall algorithm already has that complexity because of the sweeping phase). For every live object cur found, we push cur on a to-be-scanned stack, and then enter a loop that pops objects from this stack and scans their references, until the stack is empty. The scanning process for a reference value ref (into the mostly-concurrent generation) works as follows:
我们在这两种方法之间采取折衷的办法。这种折衷方法利用了外部标记位图的优点。根扫描只是简单地标记从根中可以直接访问的对象。通过使用标记位向量来表示要扫描的集合，这最小化了STW根扫描的持续时间，并且不增加额外的空间开销。然后，并发标记阶段由对这一代的线性遍历组成，在标记位向量中搜索活动对象。(这个过程的成本与堆大小成比例，而不是与活动数据的数量成比例，但是由于扫描阶段，整个算法已经具有这种复杂性)。对于找到的每个活动对象cur，我们将cur压入待扫描栈，然后进入一个循环，从该栈弹出对象并扫描它们的引用（字段），直到栈为空。扫描引用ref(指向部分并发的一代)的过程如下:

if ref points ahead of cur, the corresponding object is simply marked, without being pushed on the stack; it will be visited later in the linear traversal.
如果ref点在cur之前，则简单地标记相应的对象，而不将其入栈；稍后将在线性遍历中访问它。
if ref points behind cur, the corresponding object is both marked and pushed on the stack.
如果ref点位于cur之后，相应的对象将被标记并入栈。

image.png

Figure 3 illustrates this process. The marking traversal has just discovered a marked object c*, whose address becomes the value of cur. Scanning c* ﬁnds two outgoing references, to a and e. Object e is simply marked, since its address follows cur. Object a is before cur, so it is both marked and scanned. This leads to b, which is also before cur, so it too is marked and scanned. Object b’s reference to d, however, only causes d to be marked, since it follows cur, and will therefore be scanned later in the traversal.
图 3 展示了这个过程，标记遍历的过程中发现了一个被标记对象 c*，对应的下标为 cur。扫描 c* 发现有两个引用字段分别指向了 a 和 e。因为对象 e 的下标在 cur 后面，所以 e 只是简单地标记一下。对象 a 在 cur 前面，所以它既被标记又被扫描。这导致了 b 也被标记和扫描。b 又引用了 d，然而，这只会让 d 被标记，因为 d 在 cur 后面，并且它将会在后面的遍历中被扫描。

This technique reduces demand on the to-be-scanned stack, since no more than one object directly reachable from the root set is ever on the stack. A potential disadvantage of this approach is the linear traversal searching for live objects, which makes the algorithmic complexity of marking contain a component proportional to the size of the generation, rather than just the number of nodes and edges in the pointer graph. This is a practical diﬃculty only if the cost of searching for marked objects outweighs the cost of scanning them if when found, which will occur only if live objects are sparse. Note that if live objects are sparse, the use of a bitmap allows large regions without live objects to be skipped eﬃciently, by detecting zero words in the bit vector. A similar technique was used by Printezis [25] in the context of a disk garbage collector.
这种技术减少了对待扫描栈的需求，因为在栈上从根集直接访问的对象不超过一个。这种方法的一个潜在缺点是对活动对象的线性遍历搜索，这使得标记的算法复杂度与代空间大小成比例，而不仅仅是指针图中的节点和边的数量。这是一个实际的困难，只有当搜索标记对象的成本超过发现标记对象时扫描它们的成本时才会出现这种情况，而只有在活动对象是稀疏的情况下才会出现这种情况。注意，如果活动对象是稀疏的，使用位图可以通过检测位向量中的零字来有效地跳过没有活动对象的大型区域。Printezis[25]在磁盘垃圾收集器上下文中使用了类似的技术。

参考：
CMS 论文论文百度网盘链接
 https://hllvm-group.iteye.com/group/topic/44529

如果不好理解论文的标记对象部分，可以阅读 https://zhuanlan.zhihu.com/p/100709946