memory-barriers/fences JAVA 内存屏障[译]

原文地址：http://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html

In this article I’ll discuss the most fundamental technique in concurrent programming known as memory barriers, or fences, that make the memory state within a processor visible to other processors.

本文讨论的是最基础的并发编程技术，将称为内存屏障（或者叫栅栏），内存屏幕使一个线程的内存状态在另一个线程可见。

CPUs have employed many techniques to try and accommodate the fact that CPU execution unit performance has greatly outpaced main memory performance. In my “Write Combining” article I touched on just one of these techniques. The most common technique employed by CPUs to hide memory latency is to pipeline instructions and then spend significant effort, and resource, on trying to re-order these pipelines to minimise stalls related to cache misses.

CPU 已经拥有很多种技术来尝试并适应CPU的性能远远高于内存性能的事实，我之前的文章《Write Combining》一文，我只是接触过这其中的某一种技术，CPU拥有的最普通躲避内存延迟是管道技术，并花了大量的努力和资源去重排这些指令以减少缓存未命相关的一摊问题。

When a program is executed it does not matter if its instructions are re-ordered provided the same end result is achieved. For example, within a loop it does not matter when the loop counter is updated if no operation within the loop uses it. The compiler and CPU are free to re-order the instructions to best utilise the CPU provided it is updated by the time the next iteration is about to commence. Also over the execution of a loop this variable may be stored in a register and never pushed out to cache or main memory, thus it is never visible to another CPU.

当一个程序被执行的时侯，如果(provided)能得到相同的最终结果，那么它的指令是否被重排并不重要。举个例子，在一个循环中如果没有对计数器操作，计数器什么时侯更新并不重要。如果下一次迭代在即将开始的时才更新它（计数器),会最优化的使用CPU，因此编译器和CPU会自由的重排这些指令，以提高性能。然尔，在循环执行的过程中，这个变量可能被存储在寄存器并从未被压到缓存或主存中，那么这时的这个变将不会被另一个CPU可见。

CPU cores contain multiple execution units. For example, a modern Intel CPU contains 6 execution units which can do a combination of arithmetic, conditional logic, and memory manipulation. Each execution unit can do some combination of these tasks. These execution units operate in parallel allowing instructions to be executed in parallel. This introduces another level of non-determinism to program order if it was observed from another CPU.

CPU的内核包括多个执行单元，举个例子，当代的INTEL CPU包含6个执行单元(核),它能执行算术，条件逻辑和内存操作的组合。每个执行单元都可以做一样的这些任务的组合，这些执行单元会并行的执行，并允许指令并行的执行。这样如果从另一个CPU执行单元的角度去看，又带来了另一层关于程序执行顺序的不确定性问题。

Finally, when a cache-miss occurs, a modern CPU can make an assumption on the results of a memory load and continue executing based on this assumption until the load returns the actual data.

最终，如果发生了缓存未命中，那么当代CPU会对内存加载的结果会做一个假设，然后继续基于这个假设执行，直到实际的值返回。

Provided “program order” is preserved the CPU, and compiler, are free to do whatever they see fit to improve performance.

提供的"程序顺序"是保护CPU和编译器能够自由地做他们认为能适当提升性能的优化。

Figure 1

Figure 1.
Loads and stores to the caches and main memory are buffered and re-ordered using the load, store, and write-combining buffers. These buffers are associative queues that allow fast lookup. This lookup is necessary when a later load needs to read the value of a previous store that has not yet reached the cache. Figure 1 above depicts a simplified view of a modern multi-core CPU. It shows how the execution units can use the local registers and buffers to manage memory while it is being transferred back and forth from the cache sub-system.

加载和存储到缓存或主存是被缓冲的并被重排的，它们用load 指令，store指令和write-combining（写组合）缓冲，这些相关联的缓冲是允许快速查找的队列。当一个延迟加载需要读上一个store的值的时，而这时值还没有到达，这个快速查找队列就很有必要。Figure 1 描绘了当代多核CPU的简单视图。它展示了当它在缓存子系统前后来回传输数据时，CPU执行单元如何使用本地寄存器和缓冲去管理内存。

In a multi-threaded environment techniques need to be employed for making program results visible in a timely manner. I will not cover cache coherence in this article. Just assume that once memory has been pushed to the cache then a protocol of messages will occur to ensure all caches are coherent for any shared data. The techniques for making memory visible from a processor core are known as memory barriers or fences.

在多线程环境下，需要采用一些技术让程序的运行结果让其他CPU及时可见。在这篇文章中，我们不讨论缓存一致性的问题。只是假定一次内存压入缓存的操作之后,协议消息将发会发生，对于任何共享数据，它确保所有缓存是一致的。这种源于CPU核内存可见的技术我们叫做 memory barries 或fences。

Memory barriers provide two properties. Firstly, they preserve externally visible program order by ensuring all instructions either side of the barrier appear in the correct program order if observed from another CPU and, secondly, they make the memory visible by ensuring the data is propagated to the cache sub-system.

内存屏障提供两种属性：首先，从另一个CPU核的视角，它们通过确保屏障两端的所有指令都表现出正确的程序顺序，以保证外部可见的程序顺序。第二，它们使内存可见通过确保数据被传递到缓存子系统。

Memory barriers are a complex subject. They are implemented very differently across CPU architectures. At one end of the spectrum there is a relatively strong memory model on Intel CPUs that is more simple than say the weak and complex memory model on a DEC Alpha with its partitioned caches in addition to cache layers. Since x86 CPUs are the most common for multi-threaded programming I’ll try and simplify to this level.

内存屏障是一个很复杂的主题。在不同的CPU体系结构当中它们的实现有很大的不同。另外在INTEL 体系结构的CPU中有相对来说更强的内存模型，这种内存模型相对于DEC Alpha 体系结构的所具有分区缓存和缓存层的弱且复杂的内存模型更简单。由于X86的CPU对于多线程编程最为普通，所以我将简化到这个层次。

Store Barrier
A store barrier, “sfence” instruction on x86, forces all store instructions prior to the barrier to happen before the barrier and have the store buffers flushed to cache for the CPU on which it is issued. This will make the program state visible to other CPUs so they can act on it if necessary. A good example of this in action is the following simplified code from the BatchEventProcessor in the Disruptor. When the sequence is updated other consumers and producers know how far this consumer has progressed and thus can take appropriate action. All previous updates to memory that happened before the barrier are now visible.

存储屏障

存储屏障，在x86体系结构中叫"sfence"，它强制所有在屏障(barrier)之前的指令发生（happen before）在barrier（屏障）之前执行。并让存储缓冲区刷新到它提交CPU缓存中。这样会使程序的状态对其他CPU可见，以便在需要的时侯对其进行操作。这个场景比较好的例子是下边这段简单的代码，这段代码出自Disruptor (一个并发开源框架)的BatchEventProcessor. 当这个sequence 变量更修改之后，其他的消费者和生产者都知道这个消费者的进展情况并采取行动。所有屏障之前的内存更新都将可见。


private volatile long sequence = RingBuffer.INITIAL_CURSOR_VALUE;
// from inside the run() method
T event = null;
long nextSequence = sequence.get() + 1L;
while (running)
{
    try
    {
        final long availableSequence = barrier.waitFor(nextSequence);
        while (nextSequence <= availableSequence)
        {
            event = ringBuffer.get(nextSequence);
            boolean endOfBatch = nextSequence == availableSequence;
            eventHandler.onEvent(event, nextSequence, endOfBatch);
            nextSequence++;
        }
        sequence.set(nextSequence - 1L);
        // store barrier inserted here !!!
    }
    catch (final Exception ex)
    {
        exceptionHandler.handle(ex, nextSequence, event);
        sequence.set(nextSequence);
        // store barrier inserted here !!!
        nextSequence++;
    }
}

Load Barrier 加载屏障

A load barrier, “lfence” instruction on x86, forces all load instructions after the barrier to happen after the barrier and then wait on the load buffer to drain for that CPU. This makes program state exposed from other CPUs visible to this CPU before making further progress. A good example of this is when the BatchEventProcessor sequence referenced above is read by producers, or consumers, in the corresponding barriers of the Disruptor.

load barrier 在x86体系结构中叫"ifence",强制所有的在屏障之后的指令都在屏障之后执行，然后等待这个CPU上的缓冲清空。当前CPU进一步执行之前能够读取其他CPU所暴露的程序状态。这个场景的一个好的例子是：上边Disruptor框架中BatchEventProcessor类里引用的sequence相应的屏障中被生产者或消费读取。

Full Barrier 全屏障

A full barrier, “mfence” instruction on x86, is a composite of both load and store barriers happening on a CPU.

全屏障，在X86的体系结构中叫"mfance",它是load和store两个指令的组合。

Java Memory Model

In the Java Memory Model a volatile field has a store barrier inserted after a write to it and a load barrier inserted before a read of it. Qualified final fields of a class have a store barrier inserted after their initialisation to ensure these fields are visible once the constructor completes when a reference to the object is available.

中JAVA内存模型中volatile 关键字在写之后插入屏障并在读之前插入屏障，final关键字在它初始化之后插入屏障，以确保这些字段是在一次构建方法执行完成后，当这个对象的引用可用时可见。

Atomic Instructions and Software Locks 原子指令和软件锁

Atomic instructions, such as the “lock …” instructions on x86, are effectively a full barrier as they lock the memory sub-system to perform an operation and have guaranteed total order, even across CPUs. Software locks usually employ memory barriers, or atomic instructions, to achieve visibility and preserve program order.

原子指令，在X86平台为"lock"指令前缀，它们是有效的full barries，它们会锁内存子系统去执行一个操作并拥有一个顺序保证，即使是跨多个CPU。软件锁通常使用内存屏障，或者原子指令，以获得可见性并保证程序的执行顺序。

Performance Impact of Memory Barriers 内存屏障的性能影响

Memory barriers prevent a CPU from performing a lot of techniques to hide memory latency therefore they have a significant performance cost which must be considered. To achieve maximum performance it is best to model the problem so the processor can do units of work, then have all the necessary memory barriers occur on the boundaries of these work units. Taking this approach allows the processor to optimise the units of work without restriction. There is an advantage to grouping necessary memory barriers in that buffers flushed after the first one will be less costly because no work will be under way to refill them.

内存屏障预防CPU执行一些减少内存访问延迟的技术。因为它们一般会存在很大的性能损耗，它必须保证一致性。为取得最优的性能，最好对问题进行建模，以致于处理器能执行工作单元，然后在工作单元的边界上插入必要的内存屏障。采用这种方法允许处理器不受限制的优化工作单元，把必要的存储关卡分组是有益的，那样，在第一个之后的 buffer 刷新的开销会小点，因为没有工作需要进行重新填充它。

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 158,736评论 4赞 362
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,167评论 1赞 291
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,442评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,902评论 0赞 204
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,302评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,573评论 1赞 216
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,847评论 2赞 312
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,562评论 0赞 197
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,260评论 1赞 241
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,531评论 2赞 245
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,021评论 1赞 258
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,367评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,016评论 3赞 235
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,068评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,827评论 0赞 194
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,610评论 2赞 274
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,514评论 2赞 269