2016/01/25

内存分配的过程

用户态的程序调用 malloc 函数申请内存. malloc 是一个库函数, 由 glibc 实现. glibc 内部实现了一个内存管理机制, 在用户态层对进程空间的内存分配进行管理.
不同的情况下, malloc 会调用系统调用 brk 或是 mmap 函数向内核申请内存区域.
当使用 mmap 申请内存成功后, 内核为这个新的内存区域创建一个VMA结点. 这个结点放入进程内存描述符 struct mm_struct 中的 mmap分量保存.
我们都知道 Linux 利用 Intel CPU 的保护模式 (虚拟地址模式) , 采用页表方式进行内存管理. 每个进程拥有自己的页面入口. 这个入口也记录在进程描述符中. 分量是 pgd.
对于新分配的内存区域, 内核并没有真的为其分配相应的物理内存. Linux 采用"用时才分配"的策略". 当进程访问这个新分配的内存区域时, 触发"缺页中断", 由这个中断陷入内核. 内核才开始为这个内存区域分配相应的物理内存. 物理内存是以页为单位分配的. 所有的页根据 buddyinfo 算法组织起来, 形成多个队列. 队列内的每个成员是相同大小的页面. 第一个队列的成员的大小是 2^0 * pagesize, 意味着是 4KB. 第二个队列的成员的大小是 2^1 * pagesize, 意味着是8KB. 最后一个队列的成员大小是 2^10 pagesize, 意味着是 4MB. buddyinfo 从最适合的队列中取出页面分配出去. 所以, 通过"缺页中断", 进程请求内核向 buddyinfo 申请分配页面. 一旦分配成功, 这个新的内存区域与其对应页的关系被更新到了进程页表中.
注:　由于进程空间的内存是虚拟地址, 所以, buddyinfo 往往从　ORDER=0　的队列（2^0pagesize）里为进程分配页面．即使进程需要多个页面，buddyinfo也只是从 ORDER=0　队列中分配出相应数量的单一页面．
到这里，来自用户态的进程申请的虚拟地址空间与物理内存关联到了一起，后续的程序可以访问这个虚拟地址空间了．
除了用户态进程需要申请内存．来自内核的内存申请也需要 buddyinfo进行分配．内核申请内存有两个来源：kmalloc 和 vmalloc ．虽然kmalloc和vmalloc申请的仍是虚拟地址，但kmalloc 需要物理地址连续的空间，而vmalloc并不要求物理地址连续． kmalloc 用于小块内存分配，基于slab实现的．vmalloc 没有特别的限制．
下图表现了上面提到的各个术语之间的关系：（malloc, glibc, VMA, mm_struct, gdt, kmalloc, slab, vmalloc, buddyinfo)

< ---------------------------------- >

内存碎片

虽然执行内存分配的层次有很多，但记录了内存空间的使用情况，分进行管理工作的，只有glibc库和buddyinfo．由于内存是以大小不等的方式分配出去，并且，分配的内存区域可以释放和再分配．这不可避免的会出现碎片的问题．在 buddyinfo　出现内存碎片，会降低系统性能，并有可能影响系统的正常运行．

内存碎片对性能的影响

维基在内存碎片概念的定义中举了一个例子，说明了内存碎片对性能的影响。
A subtler problem is that fragmentation may prematurely exhaust a cache, causing thrashing, due to caches holding blocks, not individual data. For example, suppose a program has a working set of 256 KiB, and is running on a computer with a 256 KiB cache (say L2 instruction+data cache), so the entire working set fits in cache and thus executes quickly, at least in terms of cache hits. Suppose further that it has 64 translation lookaside buffer (TLB) entries, each for a 4 KiB page: each memory access requires a virtual-to-physical translation, which is fast if the page is in cache (here TLB). If the working set is unfragmented, then it will fit onto exactly 64 pages (the page working set will be 64 pages), and all memory lookups can be served from cache. However, if the working set is fragmented, then it will not fit into 64 pages, and execution will slow due to thrashing: pages will be repeatedly added and removed from the TLB during operation. Thus cache sizing in system design must include margin to account for fragmentation.
理解这个例子，需要了解MMU，TLB和页表之间的关系。
前文提到了，Linux 利用 Intel CPU的保护模式，采用页表的方式对内存进行管理。虚拟线性地址对应着某个页。这之间的对应关系存在于页表之中。由于几乎每次对虚拟内存中的页面访问都必须先解析页，从而得到物理内存中的对应地址，所以页表操作的性能非常关键。因此，Intel MMU 系统结构中实现了一个TLB（translate lookaside buffer）作为一个将虚拟地址映射到物理地址的硬件缓存，当请求访问一个虚拟地址时，处理器将首先检查TLB是否缓存了该虚拟地址到物理地址的映射，如果命中则直接返回，否则，就需要通过页表搜索需要的物理地址。
TLB很小，只有64 entries 。当内存碎片化后，一个进程的虚拟线性地址空间对应于数量众多的小片的页，TLB不能容纳这么多的页面表项，这就意味在这个进程的运行期内，MMU在寻址时，TLB总是不能命中，而需要不断的更新。这就大大的降低了执行的效率。
上述的性能问题非常的隐晦，无法从一个或是数个直观的系统运行指标中发现。为了验证这个问题，使用了下述的程序进行了验证。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define KB (1024)
#define MB (1024 * KB)
#define GB (1024 * MB)

int main(int argc, char *argv[])
{
char p;
int i = 0;

while ((p = (char )malloc(70KB)))
{
memset(p, 0, 70KB);

if ( i>500000 )
break;

i++;
}

sleep(1);

return 0;
}

a.png

可以明显看到， LAB 112，dTLB-load-misses 要远大于同级别 benchmark的InternalBeta 132 ．142,125,471 >> 3,799,475. 运行时间也多出了5秒 . 这个程序是没有任何高CPU操作的 . 在这个情况下慢了33%. 这是体现了LAB112的性能下降 .

内存碎片对kmalloc的影响

由于kmalloc要求分配连续的物理地址空间，当buddyinfo中没有大块的页，那么，无法得到满足的kmalloc会报错。在内核信息中可以看到 kmalloc fail.

操作系统的恢复机制 - 回收和紧缩

Q: 如何确定系统已经碎片化？
A: cat /proc/buddyinfo
Q: 如何查看dTLB-load-misses?
A: perf stat -e dTLB-load,dTLB-load-misses ./brush_mem；　perf top 可以查看当前系统最繁忙的函数
Q: 如何查看缺页中断？
A: ps -o majflt,minflt -C prog
Q: 虚拟线性地址如何对应到物理地址？
A: 页表中包含有页的起始物理地址，通过虚拟地址在页表中查到对应的页，就可以得到最终的物理地址

附1 MMU, TLB, Cache之间的关系

内存碎片的检测
内存碎片原因的定位
ftrace, systemtap,

内部碎片 / 外部碎片 , 池化技术解决不了外部碎片对性能的影响. 池化有可能分配的还是分散的页面. 如果用户态进程强制要求分配连续的物理地址叫呢?

overcommit : Link
内存碎片的定义：维基上的定义

内存碎片对性能影响的说明

A subtler problem is that fragmentation may prematurely exhaust a cache, causing thrashing, due to caches holding blocks, not individual data. For example, suppose a program has a working set of 256 KiB, and is running on a computer with a 256 KiB cache (say L2 instruction+data cache), so the entire working set fits in cache and thus executes quickly, at least in terms of cache hits. Suppose further that it has 64 translation lookaside buffer (TLB) entries, each for a 4 KiB page: each memory access requires a virtual-to-physical translation, which is fast if the page is in cache (here TLB). If the working set is unfragmented, then it will fit onto exactly 64 pages (the page working set will be 64 pages), and all memory lookups can be served from cache. However, if the working set is fragmented, then it will not fit into 64 pages, and execution will slow due to thrashing: pages will be repeatedly added and removed from the TLB during operation. Thus cache sizing in system design must include margin to account for fragmentation.
Memory fragmentation is one of the most severe problems faced by system managers.[citation needed] Over time, it leads to degradation of system performance. Eventually, memory fragmentation may lead to complete loss of (application-usable) free memory.
一个系统进入Thrashing的案例：Link
内存碎片的查看：cat /proc/buddyinfo -> buddy system : 伙伴系统是什么
cat /proc/buddyinfo
External fragmentation is a problem under some workloads, and buddyinfo is a useful tool for helping diagnose these problems. Buddyinfo will give you a clue as to how big an area you can safely allocate, or why a previous allocation failed.
Each column represents the number of pages of a certain order which are available. In this case, there are 0 chunks of 2^0PAGE_SIZE available in ZONE_DMA, 4 chunks of 2^1PAGE_SIZE in ZONE_DMA, 101 chunks of ^4*PAGE_SIZE available in ZONE_NORMAL, etc...
More information relevant to external fragmentation can be found in pagetypeinfo.
The more detail about buddyinfo Link Link
cat /proc/pagetypeinfo
Fragmentation avoidance in the kernel works by grouping pages of different migrate types into the same contiguous regions of memory called page blocks. A page block is typically the size of the default hugepage size e.g. 2MB on X86-64. By keeping pages grouped based on their ability to move, the kernel can reclaim pages within a page block to satisfy a high-order allocation. The pagetypinfo begins with information on the size of a page block. It then gives the same type of information as buddyinfo except broken down by migrate-type and finishes with details on how many page blocks of each type exist.
If min_free_kbytes has been tuned correctly (recommendations made by hugeadm from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can make an estimate of the likely number of huge pages that can be allocated
at a given point in time. All the "Movable" blocks should be allocatable unless memory has been mlock()'d. Some of the Reclaimable blocks should also be allocatable although a lot of filesystem metadata may have to be reclaimed to achieve this.
Linux proc 文件系统官文说明

Linux tool to dump x86 CPUID information about the CPU - 用来查看CPU信息，可以看到L1, L2 cache.
工具 per f可以提供很多底层的数据。它的详细的wiki是：Link , 使用说明 Link perf stat -e dTLB-load,dTLB-load-misses ./brush_mem；　perf top 可以查看当前系统最繁忙的函数
工具 ps 查看进程的缺页中断信息。 PS有很多种类的信息可以看到。
ps -o majflt,minflt -C mailto
其中 majflt 与 minflt 的不同是::
majflt 表示需要读写磁盘，可能是内存对应页面在磁盘中需要load 到物理内存中，也可能是此时物理内存不足，需要淘汰部分物理页面至磁盘中。
注意，缺页中断与内存碎片没有关系。Linux的延迟分配，会导致访问新分配的空间会产生缺页中断。
TLB与Cache的关系：Link。同时也提供了一个查看TLB的命令：dmidecode -t cache
MMU，Cache, TLB的关系：Link, Link2

b.png

一篇带有图片，讲清楚了虚拟内存，物理内存, buddy, Slab之间的关系：Link

c.PNG

d.png

e.png

TLB的数据，DTLB Load Misses Retired　何解呢？ Link, Chinese Link;
Thread Specificity: TS
The number of retired load instructions that experienced data translation lookaside buffer (DTLB) misses. When a 32-bit linear data address is submitted by the processor, it is first submitted to the DTLB. The translation lookaside buffer (TLB) translates the 32-bit linear address produced by the load unit into a 36-bit physical memory address before the cache lookup is performed. DTLB size and organization are processor design-specific. A DTLB miss requires memory accesses to the OS page directory and tables in order translate the 32-bit linear address.
A DTLB miss does not necessarily indicate a cache miss.
CAUTION:
Extra memory accesses could impact CPI.
TIP
To minimize DTLB misses, minimize the size of the data and locality such that:
§ data spans a minimum the number of pages
§ the number of pages the data spans is less than the number of DTLB entries
kmalloc、vmalloc、malloc的区别 : Link
• kmalloc和vmalloc是分配的是内核的内存,malloc分配的是用户的内存
• kmalloc保证分配的内存在物理上是连续的,vmalloc保证的是在虚拟地址空间上的连续,malloc不保证任何东西(这点是自己猜测的,不一定正确)
• kmalloc能分配的大小有限,vmalloc和malloc能分配的大小相对较大
• 内存只有在要被DMA访问的时候才需要物理上连续
• vmalloc比kmalloc要慢
malloc底层过程：Link M_TRIM_THRESHOLD选项 : default value for this parameter is 128*1024; Link
Linux内存分配的 overcommit 机制 --> atop's SWP 有两个参数来反映　->
This line contains the total amount of swap space on disk ('tot') and the amount of free swap space ('free').
Furthermore the committed virtual memory space ('vmcom') and the maximum limit of the committed space ('vmlim', which is by default swap size plus 50% of memory size) is shown. The committed space is the reserved virtual space for all allocations of private memory space for processes. The kernel only verifies whether the committed space exceeds the limit if strict overcommit handling is configured (vm.overcommit_memory is 2).
cat /proc/sys/vm/overcommit_memory
0: heuristic overcommit (this is the default)
1: always overcommit, never check
2: always check, never overcommit

回收： Link1Link2
Zone_reclaim_mode allows someone to set more or less aggressive approaches to reclaim memory when a zone runs out of memory. If it is set to zero then no
zone reclaim occurs. Allocations will be satisfied from other zones / nodes in the system.
This is value ORed together of

1 = Zone reclaim on
2 = Zone reclaim writes dirty pages out
4 = Zone reclaim swaps pages

zone_reclaim_mode is disabled by default. For file servers or workloads that benefit from having their data cached, zone_reclaim_mode should be left disabled as the caching effect is likely to be more important than data locality.
zone_reclaim may be enabled if it's known that the workload is partitioned such that each partition fits within a NUMA node and that accessing remote memory would cause a measurable performance reduction. The page allocator
will then reclaim easily reusable pages (those page cache pages that are currently not used) before allocating off node pages.
Allowing zone reclaim to write out pages stops processes that are writing large amounts of data from dirtying pages on other nodes. Zone reclaim will write out dirty pages if a zone fills up and so effectively throttle the process. This may decrease the performance of a single process since it cannot use all of system memory to buffer the outgoing writes anymore but it preserve the memory on other nodes so that the performance of other processes running on other nodes will not be affected.
Allowing regular swap effectively restricts allocations to the local node unless explicitly overridden by memory policies or cpuset configurations.

cgroup 系统资源管理：Link

sar??
curl - memory poll ?

/proc/sys/vm/zone_reclaim_mode

f.png

TLB（转移后备缓冲区 Translation Lookaside Buffer）

这类缓存主要用于存放物理地址以加速对线性地址的转换操作。当线性地址被第一次使用时，通过页目录/页表计算得出相应的物理地址，这个地址在使用后将被缓存在TLB中，以备将来对同一线性地址引用时直接从TLB中得到其对应的物理地址。这里还要注意的是，当CR3控制寄存器被更新时，硬件将自动使TLB中的所有项均无效，因为CR3被更改后将存放新的页目录基地址，所以线性地址转换时不允许再引用TLB中的表项。

物理地址扩展（PAE — Physical Address Extension）下的分页机制

引入物理地址扩展，主要是因为4GB的物理内存对于运行数以千计的进程的大型服务器构成了很大的瓶颈，因此这促使Intel对x86的物理内存进行扩展。Intel通过将处理器上的管脚数从32位增加到36使得寻址能力达到236=64GB，使其得以满足高端市场的需求，最后也就导致了对页内存寻址的另类解释。事实上现在的低端PC市场都已经逐渐从32位向64位过度，所以对这一机制简单介绍一下。
http://www.kerneltravel.net/journal/v/mem.htm
http://coolshell.cn/articles/7490.html

2018-05-13 (旧文整理) 内存碎片综述