Cgroup - 内存子系统 Memory Resource Controller

自序

本文译自 kernel 文档《Memory Resource Controller》

虽然是官方文档，但是有用的信息真的不多，推荐阅读此博客，我深入研究后，将写有关linux内存管理的系列文章，争取将号称linux中最复杂的内存管理模块搞清楚

前言

此文档已经过时了，需要彻底的重写，但是仍然包含一些有用的信息，因此具有一定的参考价值，如果需要更加深入的了解，需要检查当前的内核代码

内存资源控制器在本文档中通常称为内存控制器。不要将此处使用的内存控制器与硬件中使用的内存控制器混淆。

(For editors)在本文档中:
当我们提到带有内存控制器的 cgroup（cgroupfs 的目录）时，我们称其为“memory cgroup”。不过在 git-log 和源代码中，补丁的标题和函数名称则倾向于使用“memcg”。在本文档中，我们避免使用它。

memory controller 的优点和用途

内存控制器将一组任务的内存行为与系统的其余部分隔离开来。《Controlling memory use in containers》一文中提到了内存控制器的一些可能用途。内存控制器可用于：

隔离一个或一组Application，可以限制一个“Memory-hungry”的Application仅仅使用较少的内存
创建一个内存有限的cgroup，this can be used as a good alternative to booting with mem=XXXX.
对于实现虚拟化，可以用来控制分配给虚拟机实例的内存量
A CD/DVD burner could control the amount of memory used by the rest of the system to ensure that burning does not fail due to lack of available memory.
There are several other use cases; find one or use the controller just for fun (to learn and hack on the VM subsystem).

在 linux-2.6.34-mmotm(development version of 2010/April)版本中，拥有一下特性：

统计和限制anonymous pages、 file caches、swap caches的使用量
每个memory cgroup拥有一个pages的LRU链表，且没有全局的LRU
可以统计和限制 memory+swap 的使用
分层级统计
实现了软限制
在迁移task时，是否迁移其统计数据，是可选的，即可以将统计数据一起迁移，也可以不迁移
用量阈值通知器 (usage threshold notifier)
内存压力通知器 (memory pressure notifier)
oom-killer 禁用旋钮和 oom-notifier，(oom即out of memory，指示内存不足，这是严重错误，一般需要杀死进程)
不限制Root cgroup(Root cgroup has no limit controls).

控制文件及其功能总结：

文件名	功能
tasks	attach a task(thread) and show list of threads
cgroup.procs	show list of processes
cgroup.event_control	an interface for event_fd()，参见 eventfd(2)
memory.usage_in_bytes	show current usage for memory (See 5.5 for details)
memory.memsw.usage_in_bytes	show current usage for memory+Swap (See 5.5 for details)
memory.limit_in_bytes	set/show limit of memory usage，设置或显示内存使用的硬限制
memory.memsw.limit_in_bytes	set/show limit of memory+Swap usage
memory.failcnt	show the number of memory usage hits limits，内存使用达到限制的次数
memory.memsw.failcnt	show the number of memory+Swap hits limits
memory.max_usage_in_bytes	show max memory usage recorded，曾达到的最大内存用量
memory.memsw.max_usage_in_bytes	show max memory+Swap usage recorded
memory.soft_limit_in_bytes	set/show soft limit of memory usage
memory.stat	show various statistics，显示各种统计信息
memory.use_hierarchy	set/show hierarchical account enabled. 已弃用，不应配置
memory.force_empty	trigger forced page reclaim. 触发强制页面回收
memory.pressure_level	set memory pressure notifications. 设置内存压力通知
memory.swappiness	set/show swappiness parameter of vmscan (See sysctl's vm.swappiness) 设置/显示 vmscan 的 swappiness 参数（即内核参数vm.swappiness，值越大内核越积极使用swap，值越小内核越积极使用物理内存）
memory.move_charge_at_immigrate	set/show controls of moving charges，？？？唯一看不懂的
memory.oom_control	set/show oom controls.
memory.numa_stat	show the number of memory usage per numa node. 显示每个 numa 节点的内存使用量
memory.kmem.limit_in_bytes	set/show hard limit for kernel memory This knob is deprecated and shouldn't be used. It is planned that this be removed in the foreseeable future.设置/显示内核内存的硬限制，不应使用。计划在可预见的未来将其删除。
memory.kmem.usage_in_bytes	show current kernel memory allocation，显示当前 kernel memory 的分配
memory.kmem.failcnt	show the number of kernel memory usage hits limits
memory.kmem.max_usage_in_bytes	show max kernel memory usage recorded
memory.kmem.tcp.limit_in_bytes	set/show hard limit for tcp buf memory
memory.kmem.tcp.usage_in_bytes	show current tcp buf memory allocation
memory.kmem.tcp.failcnt	show the number of tcp buf memory usage hits limits
memory.kmem.tcp.max_usage_in_bytes	show max tcp buf memory usage recorded

一、历史

The memory controller has a long history. A request for comments for the memory controller was posted by Balbir Singh [1]. At the time the RFC was posted there were several implementations for memory control. The goal of the RFC was to build consensus and agreement for the minimal features required for memory control. The first RSS controller was posted by Balbir Singh[2] in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the RSS controller. At OLS, at the resource management BoF, everyone suggested that we handle both page cache and RSS together. Another request was raised to allow user space handling of OOM. The current memory controller is at version 6; it combines both mapped (RSS) and unmapped Page Cache Control [11].

二、Memory Control

内存是一种独特的资源，因为它的数量有限。如果任务需要大量 CPU 处理，则该任务可以将其处理时间分散到数小时、数天、数月或数年，但对于内存，需要重用相同的物理内存来完成任务。

内存控制器的实现被分为几个阶段(phases)。分别是：

Memory controller
mlock(2) controller
Kernel user memory accounting and slab control
user mappings length controller

内存控制器是最早开发的控制器。

2.1 Design

内存控制器的设计核心是一个被称之为page_counter的计数器。page_counter 会跟踪与控制器关联的进程组的当前内存使用情况，然后据此实现内存限制。每个 cgroup 都有一个与之关联的内存控制器特定数据结构 (mem_cgroup)。

2.2 Accounting

Figure 1: Hierarchy of Accounting

图 1 显示了控制器的重要方面(aspects)

Accounting是针对每个cgroup，即每个cgroup维护自己私有的mem_cgroup结构
每个 mm_struct 都知道它属于哪个 cgroup，mm_struct 由进程中所有线程共享
每个page都有一个指向page_cgroup的指针，page_cgroup又知道它所属的cgroup

Account的流程是：
在需要统计的地方，调用 mem_cgroup_charge_common() 来更新(set)必要的数据结构(如增加内存计数)，并检查正在当前 cgroup 是否超过限制。如果是，则在 cgroup 上调用 reclaim进行内存回收，如果回收后依然超过限制，则出发oom或阻塞等待机制。如果一切顺利，就会更新名为 page_cgroup 的页面元数据结构(meta-data-structure)。 page_cgroup 在 cgroup 上有自己的 LRU。 (*) page_cgroup 结构在启动/内存热插拔(boot/memory-hotplug)时分配。

2.2.1 Accounting details

所有映射的匿名页面（RSS）和缓存页面（Page Cache）都被计算在内。某些永远不可回收且不会出现在 LRU 上的页面不会被计算在内。我们只是在通常的 VM 管理下account页面。

匿名页在发生page_fault的时候被account，page cache在被加载到inode的radix-tree中时被account。

对应的，当一个匿名页被我安全取消映射的时候会减少计数，当page cache从radix-tree中删除时会取消计数。但是如果匿名页是被交换出去，此时虽然取消了映射，但是page实际上时存在于swap cache中的，这种情况，只有当page被真正释放的时候才会减少计数，对应的，当匿名页swap-in的到swap cache中时，也会增加计数。

注意：内核会在执行swap-in的预读，也就是会一次读入多个页面。 Since page’s memcg recorded into swap whatever memsw enabled, the page will be accounted after swapin.

当发生页面迁移( page migration )的时候，会保留统计的信息。

注意:只对 pages-on-LRU 进行统计，因为我们的目的是控制使用的页面数量；从 VM 的角度来看，not-on-LRU 页面往往会失控。

2.3 Shared Page Accounting

对于共享页面，采取的时first-touch的策略，即将计数加载第一次加载page的memcg上。The principle behind this approach is that a cgroup that aggressively uses a shared page will eventually get charged for it (once it is uncharged from the cgroup that brought it in – this will happen on memory pressure).

当将一个task移动到另一个cgroup中时，此task的pages，可能会被重新统计到新的cgroup中，这取决于是否设置了move_charge_at_immigrate.

2.4 Swap Extension

Swap Extension实现了对swap用量的限制和统计，每个cgroup都维护着自己对swap的使用量。

当启用了CONFIG_SWAP参数，会添加以下两个文件：

memory.memsw.usage_in_bytes.
memory.memsw.limit_in_bytes.

memsw，即 memory+swap ，memory.memsw.limit_in_bytes用于限制其大小

比如：假设系统一个共拥有4GB的swap空间，如果一个task申请了6GB的内存，但是其内存限制为2GB，此时他将会占用4GB全部的swap空间。此时，如果设置memory.memsw.limit_in_bytes=3GB，就会避免这中swap被占满的情况，通过设定memory.memsw.limit_in_bytes可以避免swap缺乏导致的OOM。

why ‘memory+swap’ rather than swap

全局LRU(kswapd)可以任意的将页面换出，换出指的是将页面从内存交换到swap，这一过程并不会导致memory+swap产生变化，换句话说，我们并不希望在限制swap时影响kswapd的工作，因为使用 ‘memory+swap’更好。

What happens when a cgroup hits memory.memsw.limit_in_bytes
达到上限时，执行swap显然时没用用的，此时会剔除page cache，腾出内存空间。但是前面说过，kswapd可以换出任意页面，以保证系统内存的状态，这是无法从cgrop中禁止的

2.5 Reclaim

每个cgroup都维护了一个与全局LRU相同结构的LRU，当cgropu超过限制时，将首先尝试回收内存，以腾出空间，如果回收失败，则会调用OOM例程来终止croup中最庞大的任务。

注意：

reclaim操作是不能用于root cgroup的，因为无法限制root cgroup的内存
若 panic_on_oom的值为2,那么整个系统将会panic

当配置了oom event notifier，将会delivered对应的事件。

2.6 Locking

Lock order is as follows:

Page lock (PG_locked bit of page->flags)
    mm->page_table_lock or split pte_lock
        lock_page_memcg (memcg->move_lock)
            mapping->i_pages lock
                lruvec->lru_lock.

Per-node-per-memcgroup LRU (cgroup’s private LRU) is guarded by lruvec->lru_lock; PG_lru bit of page->flags is cleared before isolating a page from its LRU under lruvec->lru_lock.

2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)

通过 Kernel Memory Extension ，可以限制系统的内存使用量，系统内存是无法被换出的，which makes it possible to DoS the system by consuming too much of this precious resource.

默认情况下，所有的menory cgroup都会开启kernel的内存计数，也可以在启动时指定cgroup.memory=nokmem从而禁用对内核内存的计数。

不会限制root cgroup的kernel memory的内存限制，memory.kmem.usage_in_bytes 记录了内核内存的使用，一些特定的场景，也会有自己的单独的计数器，比如TCP使用memory.kmem.tcp.limit_in_bytes

对内核内存的计数，会被同时记录到主内存计数器当中，也就是说，如果内核内存占用了新的一个page之后，

Currently no soft limit is implemented for kernel memory. It is future work to trigger slab reclaim when those limits are reached.

2.7.1 Current Kernel Memory resources accounted

stack pages:
进程的栈空间，可以防止在kernel memory使用率过高时创建新进程

slab pages:
由 SLAB 或 SLUB 分配器分配的页面。A copy of each kmem_cache is created every time the cache is touched by the first time from inside the memcg. The creation is done lazily, so some objects can still be skipped while the cache is being created. All objects in a slab page should belong to the same memcg. This only fails to hold when a task is migrated to a different memcg during the page allocation by the cache.

sockets memory pressure:
一些套接字协议具有内存压力阈值。内存控制器允许每个 cgroup 单独控制它们，而不是全局控制。

tcp memory pressure:
TCP就是上面说的sockets的一种

2.7.2 Common use cases

因为内核内存和用户内用是统一被限制的，如果设定U是用户限制，K是内核限制，那么将存在3中可能的方式进行设置：
U != 0, K = unlimited:
这是在kmem accounting之前标准的限制机制，Kernel Memory将被完全忽略

U != 0, K < U:
这种情况对应于所有的cgroup的内存总量超过了实际物理内存的过度部署情况，此时可以将K配置为不超过实际物理内存，因为K是无法被交换的，只要K触及不到实际物理内存，问题就不会太大。而U可以被交换，此时其实会牺牲QoS。

警告：在当前的实现中，当 cgroup 在保持低于 U 的情况下达到 K 时，不会为它触发内存回收，这使得这种设置不切实际。

U != 0, K >= U:
Since kmem charges will also be fed to the user counter and reclaim will be triggered for the cgroup for both kinds of memory. This setup gives the admin a unified view of memory, and it is also useful for people who just want to track kernel memory usage.

三、 User Interface

3.0 Configuration

Enable CONFIG_CGROUPS
Enable CONFIG_MEMCG
Enable CONFIG_MEMCG_SWAP (to use swap extension)
Enable CONFIG_MEMCG_KMEM (to use kmem extension)

3.1 Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)

# mount -t tmpfs none /sys/fs/cgroup
# mkdir /sys/fs/cgroup/memory
# mount -t cgroup none /sys/fs/cgroup/memory -o memory

3.2. Make the new group and move bash into it:

# mkdir /sys/fs/cgroup/memory/0
# echo $$ > /sys/fs/cgroup/memory/0/tasks

在非root cgroup “0” 中，我们可以改变内存限制

# echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes

注意：

我们可以使用后缀（k、K、m、M、g 或 G）来表示kilo, mega 或 giga为单位的值。（这里，Kilo、Mega、Giga 是 Kibibytes、Mebibytes、Gibibytes）
可以写入-1重置限制，表示unlimited
不允许修改root cgroup

# cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
4194304

查看当前使用量：

# cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
1216512

成功写入此文件并不能保证成功地将此限制设置为写入文件的值。这可能是由于多种因素造成的，例如向上舍入到页面边界或系统上内存的总可用性。用户需要在写入后重新读取此文件以保证内核提交的值：

# echo 1 > memory.limit_in_bytes
# cat memory.limit_in_bytes
4096

memory.failcnt 字段给出超出 cgroup 限制的次数。

memory.stat 文件提供记帐信息。现在，显示了缓存、RSS 和活动页面/非活动页面的数量。

Testing

For testing features and implementation, see Memory Resource Controller(Memcg) Implementation Memo.

Performance test is also important. To see pure memory controller’s overhead, testing on tmpfs will give you good numbers of small overheads. Example: do kernel make on tmpfs.

Page-fault scalability is also important. At measuring parallel page fault test, multi-process test may be better than multi-thread test because it has noise of shared objects/status.

But the above two are testing extreme situations. Trying usual test under memory controller is always helpful.

4.1 Troubleshooting

Sometimes a user might find that the application under a cgroup is terminated by the OOM killer. There are several causes for this:

The cgroup limit is too low (just too low to do anything useful)
The user is using anonymous memory and swap is turned off or too low

A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of some of the pages cached in the cgroup (page cache pages).

To know what happens, disabling OOM_Kill as per “10. OOM Control” (below) and seeing what happens will be helpful.

4.2 Task migration

当一个task迁移到另一个cgroup时，他的page统计信息在默认情况下是不会一起被迁移的，这些page将依然在原cgroup的统计计算之内，直到这些页面被释放或者清除

4.3 Removing a cgroup

cgroup 可以通过 rmdir 删除，但如第 4.1 和 4.2 节所述，即使所有任务都已从 cgroup 迁移离开，cgroup中依然会存在一定的计数信息，这些信息将会由父cgroup管理,其实本来父cgroup中就保留了这些信息。

Charges recorded in swap information is not updated at removal of cgroup. Recorded information is discarded and a cgroup which uses swap (swapcache) will be charged as a new owner of it.

五、Misc. interfaces

5.1 force_empty

memory.force_empty 用于清空cgroup的内存使用，当写入任何内容时：

# echo 0 > memory.force_empty

cgroup 将尽其所能释放page.

典型的用法实在执行rmdir()之前调用force_empty. Though rmdir() offlines memcg, but the memcg may still stay there due to charged file caches. 一些page cache的page将会一直保留在统计信息之中，直到出现内存压力。

Also, note that when memory.kmem.limit_in_bytes is set the charges due to kernel pages will still be seen. This is not considered a failure and the write will still return success. In this case, it is expected that memory.kmem.usage_in_bytes == memory.usage_in_bytes.

5.2 stat file

memory.stat 文件包含以下统计信息：
per-memory cgroup local status

名称	信息
cache	# 页面缓存的字节数
rss	# 匿名页和swap cache的字节数(包含透明大页)
rss_huge	# 匿名透明大页面的字节数
mapped_file	# mmaped file的字节数 (includes tmpfs/shmem)
pgpgin	# 页面被计算到cache和rss的次数，或者说是内存页进入cgroup的次数
pgpgout	# 页面从cache或rss中被取走的次数，或者说是内存页走出cgroup的次数，与pgpgin 此消彼长就是当前内存占用(以页为单位)，但是目前好像`(pgpgin - pgpgout)*4096 != cache + rss`，但是比较接近，差在哪不太清楚
swap	# swap的字节数
dirty	# 等待写回磁盘的字节数 (of bytes that are waiting to get written back to the disk.)
writeback	# 等待同步到磁盘的page cache 和 swap cache的字节数(of bytes of file/anon cache that are queued for syncing to disk)
inactive_anon	# inactive LRU 链表上匿名页和swap cache的字节数
active_anon	# active LRU 链表上匿名页和swap cache的字节数
inactive_file	# inactive LRU 链表上file-backed内存的字节数
active_file	# active LRU 链表上file-backed内存的字节数
unevictable	# 无法被reclaimed的内存的字节数 (mlocked etc).

status considering hierarchy (see memory.use_hierarchy settings)

名称	信息
hierarchical_memory_limit	# memory cgroup 所在层级的内存限制，以字节为单位
hierarchical_memsw_limit	# memory cgroup 所在层级的(内存 + swap)限制，以字节为单位
total_<counter>	# <counter> 的分层版本，除了 cgroup 自己的值之外，还包括 <counter> 的所有分层子值的总和，

The following additional stats are dependent on CONFIG_DEBUG_VM

名称	信息
recent_rotated_anon	VM internal parameter. (see mm/vmscan.c)
recent_rotated_file	VM internal parameter. (see mm/vmscan.c)
recent_scanned_anon	VM internal parameter. (see mm/vmscan.c)
recent_scanned_file	VM internal parameter. (see mm/vmscan.c)

Memo:
recent_rotated means recent frequency of LRU rotation. recent_scanned means recent # of scans to LRU. showing for better debug please see the code for meanings.

Note:
只有匿名页和swap cache被计算在rss中，cgroup的常驻内存集(即cgroup所使用的memory)，应该是rss + mapped_file。为什么不是rss + cache呢，因为cache其实是包括两部分，一部分是访问普通文件产生的page cache，另一部分是执行mmap时产生的page cache，cache包括了所有的page cache。

当然了mapped_file包括了普通的文件映射，以及shmem 等，对于共享的部分，mapped_file只统计属于cgroup的那一部分。

5.3 swappiness

/proc/sys/vm/swappiness 参数和root cgroup的memory.swappiness是相关联的。

与执行全局的reclaim不同， limit reclaim在swappiness 设置为0时，即使当前存在swap存储空间，也不会进行交换，这可能导致在没有足够内存空间的时候触发OOM Killer

5.3 failcnt

memory.failcnt 和 memory.memsw.failcnt 显示内存达到限制的次数，当memory cgroup达到限制时，failcnt 会增加计数并执行内存回收。
You can reset failcnt by writing 0 to failcnt file:

# echo 0 > .../memory.failcnt

5.5 usage_in_bytes

为了效率，与其内核组件一样，内存 cgroup 使用了一些优化来避免不必要的cacheline错误共享。usage_in_bytes受到这些优化策略的影响，并不会显示内存(和swap)的精确值，它是有效访问的模糊值(Of course, when necessary, it’s synchronized.)如果要精确的获取cgroup的内存占用，可以使用memory.stat中的RSS + CACHE (+SWAP)

5.6 numa_stat

作用类似于numa_maps(见numa_maps、kernel DOC)，但是统计的单位是Cgroup。This is useful for providing visibility into the numa locality information within an memcg since the pages are allowed to be allocated from any physical node. One of the use cases is evaluating application performance by combining this information with the application’s CPU allocation.

Each memcg’s numa_stat file includes “total”, “file”, “anon” and “unevictable” per-node page counts including “hierarchical_<counter>” which sums up all hierarchical children’s values in addition to the memcg’s own value.

The output format of memory.numa_stat is:

total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ...
file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ...
anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
hierarchical_<counter>=<counter pages> N0=<node 0 pages> N1=<node 1 pages> ...

6. Hierarchy support

内存控制器支持deep hierarchy and hierarchical accounting。层次结构是通过在 cgroup 文件系统中创建相关的 cgroup 实现的。例如，考虑以下 cgroup 文件系统层次结构：

    root
  /  |   \
 /   |    \
a    b     c
           | \
           |  \
           d   e

如上图的层级结构中，若启动 hierarchical accounting 机制，那么e的所有内存使用，都会计入到其祖先中，即root 和 c，并且当祖先的内存紧张时，回收算法会从祖先的及其孩子的任务中回收内存。

6.1 Hierarchical accounting and reclaim

默认情况下启用分层记帐。不推荐禁用分层记帐。尝试这样做会导致失败，并且会向 dmesg 打印警告。

出于兼容性原因，将 1 写入 memory.use_hierarchy 将始终通过：

# echo 1 > memory.use_hierarchy

7. Soft limits

可以申请超过 Soft limits 的内存， Soft limits的意义在于，尽可能根据需求申请足够多的把内存，前提是：

不超过硬限制
当前内存资源充足

当系统存在内存争用时，cgroup会将它们的内存用量限制在Soft limits以下，如果某个cgroup的软限制数值很高，则系统会尽可能的缩减其内存，以避免其他控制组不会被饿死。

软限制是一种尽力而为的功能，他没有任何的保证，只是尽可能地确保在内存严重争用是，按软限制的设置来分配内存。软限制的内存回收是调用的 balance_pgdat (kswapd)。

7.1 Interface

Soft limits can be setup by using the following commands (in this example we assume a soft limit of 256 MiB):

# echo 256M > memory.soft_limit_in_bytes

If we want to change this to 1G, we can at any time use:

# echo 1G > memory.soft_limit_in_bytes

NOTE:

软限制生效需要花费一定的时间，因为它设计内存的回收以及在各个cgroup之前平衡
软限制应该小于硬限制

8. Move charges at task migration

用户可以选择，在进行task迁移时，将与之相关的page计数从旧cgroup中一起迁移到新的cgroup。This feature is not supported in !CONFIG_MMU environments because of lack of page tables.

8.1 Interface

默认情况下时禁止同时迁移page计数地，可以通过memory.move_charge_at_immigrate配置开启或禁用。

如果想开启：

# echo (some positive value) > memory.move_charge_at_immigrate

Note:

memory.move_charge_at_immigrate的每一bit都控制了那种类型的page计数应该被迁移，见 8.2
仅仅当移动mm.owner线程，即线程组leader的时候才会触发迁移，这是因为所有的线程的内存使用计数都要计算在线程组leader所在的cgroup中
如果迁移的过程中目标cgroup中不存在足够的空间，将会触发内存回收，如果回收后依然不足，那么迁移将会失败
如果要迁移的计数很多，这个过程需要耗费很长的时间(秒级)

如果想继续禁用，可以：

# echo 0 > memory.move_charge_at_immigrate

8.2 Type of charges which can be moved

Each bit in move_charge_at_immigrate has its own meaning about what type of charges should be moved. But in any case, it must be noted that an account of a page or a swap can be moved only when it is charged to the task’s current (old) memory cgroup.

bit	what type of charges would be moved ?
0	task的匿名页及其swap所使用的计数，必须开启Swap Extension，才支持对swap的计数的移动
1	task的映射的文件页的计数，包括普通文件、tmpfs、以及tmpfs的swap。Unlike the case of anonymous pages, file pages (and swaps) in the range mmapped by the task will be moved even if the task hasn’t done page fault, i.e. they might not be the task’s “RSS”, but other task’s “RSS” that maps the same file. And mapcount of the page is ignored (the page can be moved even if page_mapcount(page) > 1).

bit

what type of charges would be moved ?

task的匿名页及其swap所使用的计数，必须开启Swap Extension，才支持对swap的计数的移动

task的映射的文件页的计数，包括普通文件、tmpfs、以及tmpfs的swap。Unlike the case of anonymous pages, file pages (and swaps) in the range mmapped by the task will be moved even if the task hasn’t done page fault, i.e. they might not be the task’s “RSS”, but other task’s “RSS” that maps the same file. And mapcount of the page is ignored (the page can be moved even if page_mapcount(page) > 1).

8.3 TODO

All of moving charge operations are done under cgroup_mutex. It’s not good behavior to hold the mutex too long, so we may need some trick.

9. Memory thresholds

Memory cgroup基于cgroup的notification API 实现了内存阈值机制，还挺有意思的，简而言之，我们可以给cgroup设定一个阈值，当超过阈值时，我们可以通过eventfd，收到特定的消息！

使用流程如下：

Application 使用 eventfd(2)创建一个eventfd
打开 memory.usage_in_bytes 或 memory.memsw.usage_in_bytes文件，拿到其文件描述符
向cgroup.event_control中写入"<eventfd > <fd of memory.usage_in_bytes> <threshold>"

当内存使用量在任何方向上超过阈值时，应用程序将通过 eventfd 得到通知。

它同时适用于 root 和非 root cgroup。

10. OOM Control

memory.oom_control文件实现了 OOM 通知以及其他控制。

同样的，Memory Cgroup使用 cgroup 的notification API 实现了 OOM notifier。它允许注册多个OOM notification delivery，并在OMM发生时进行通知。

注册流程如下;

Application 使用 eventfd(2)创建一个eventfd
打开 memory.oom_control 文件，拿到其文件描述符
向cgroup.event_control中写入"<eventfd > <fd of memory.oom_control>"

这样在OOM发生时，将会触发eventfd，从而让Application获取通知。OOM notification doesn’t work for the root cgroup.

可以通过向memory.oom_control写入“1”来 禁用OOM-killer

#echo 1 > memory.oom_control

如果OOM-killer被禁用，那么在task超过内存限制，切无可用内存是，将会挂起此task到memory cgroup的 OOM-waitqueue中，若要将其唤醒，需要扩大限制或者减少内存使用

注意这里的减少内存使用是内存回收策略无能为力的情况下进行的，方法有：

kill some tasks.
move some tasks to other group with account migration.
remove some files (on tmpfs?)

查看memory.oom_control，可以获取以下信息：

oom_kill_disable 0 or 1 (if 1, oom-killer is disabled)
under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may be stopped.)
oom_kill ，计数被杀死的进程数目。oom_kill integer counter The number of processes belonging to this cgroup killed by any kind of OOM killer.

Memory Pressure

The pressure level notifications可用于监控memory allocation cost，其也是结合eventfd使用的；根据压力，应用程序可以实施不同的策略来管理其内存资源。pressure level 定义如下：

low
"low" level，标明系统当前正在为回收内存，以分配新的内存请求。Monitoring this reclaiming activity might be useful for maintaining cache level. Upon notification, the program (typically “Activity Manager”) might analyze vmstat and act in advance (i.e. prematurely shutdown unimportant services).
medium
"medium" level，表明系统正在面临中等的内存压力，系统可能正在进行swap操作、换出active file caches等。Upon this event applications may decide to further analyze vmstat/zoneinfo/memcg or internal memory usage statistics and free any resources that can be easily reconstructed or re-read from a disk.
critical
"critical" level，意味着系统濒临崩溃，马上就要OOM、甚至已经触发了OOM killer。 Applications should do whatever they can to help the system. It might be too late to consult with vmstat or any other statistics, so it’s advisable to take an immediate action.

在默认情况下，事件会一直向上传播，直到事件被监听器捕获处理。如现在有3个cgroup：A->B->C，且A、B、C都设置了监听器，假设此时C产生了压力，那么只有C会接收到事件；如果只有A、B设置了监听器，那么B将会收到压力事件。这样其实是为了避免消息的过度“广播”，这会干扰系统，如果我们的内存不足或颠簸，这尤其糟糕。

当然，用户也可以自定义传播的方式：

default
默认方式，即上所述
hierarchy
无论中间的cgroup是否设置了监听器，压力事件会一直向上传播，直到ROOT，如上面的例子，A、B、C都会收到压力通知
local
监听器仅仅会收到所注册cgroup的压力事件

根据我的理解，这个传播属性，既是事件的属性，也是监听器的属性，默认的情况是：

一直向上传播，遇到一个监听器就停止，这是事件属性
可处理任意接收的事件，这是监听器属性

而hierarchy改变了1，使事件无论如何都会向上传播，local改变了2，使监听器仅仅监听到自己的注册的cgroup所发出的压力事件

使用此特性的步骤：

Application 使用 eventfd(2)创建一个eventfd
打开 memory.pressure_level 文件，拿到其文件描述符
向cgroup.event_control中写入"<event_fd> <fd of memory.pressure_level> <level[,mode]>"

当内存压力达到特定级别（或更高）时，将通过 eventfd 通知应用程序。未实现对 memory.pressure_level 的读/写操作。

测试：
这是一个小脚本示例，它创建一个新的 cgroup，设置内存限制，在 cgroup 中设置通知，然后让子 cgroup 体验临界压力：

# cd /sys/fs/cgroup/memory/
# mkdir foo
# cd foo
# cgroup_event_listener memory.pressure_level low,hierarchy &
# echo 8000000 > memory.limit_in_bytes
# echo 8000000 > memory.memsw.limit_in_bytes
# echo $$ > tasks
# dd if=/dev/zero | read x

（预期显示一堆通知，最终，oom-killer 将被触发。）

其中cgroup_event_listener 是linux内核中一个小工具，源码地址：linux/cgroup_event_listener.c

但是我实际运行，没看到任何效果。。。

最后编辑于：2021.09.02 20:54:05

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 158,847评论 4赞 362
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,208评论 1赞 292
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,587评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,942评论 0赞 205
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,332评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,587评论 1赞 218
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,853评论 2赞 312
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,568评论 0赞 198
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,273评论 1赞 242
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,542评论 2赞 246
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,033评论 1赞 260
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,373评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,031评论 3赞 236
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,073评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,830评论 0赞 195
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,628评论 2赞 274
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,537评论 2赞 269