1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs

1. Abstract

At no or nearly no loss of accuracy, quantize the gradients aggressively—to but one bit per value—if the quantization error is carried forward across minibatches (error feedback). 对gradient进行量化，不会带来精确度的下降，只要本minibatch的量化误差加到下一个minibatch的gradient上
This size reduction makes it feasible to parallelize SGD through data-parallelism with fast processors like recent GPUs. 数据压缩能够更有效地在数据并行下并行SGD
Combining this finding with AdaGrad, automatic minibatch-size selection, double buffering, and model parallelism. Unexpectedly, quantization benefits AdaGrad, giving a small accuracy gain. 结合了AdaGrad、自动选择minibatch大小、双缓冲和模型并行。

2. Intro

nodes. Each node computes a sub-gradient on its sub-minibatch. These sub-gradients, of the same dimension as the full model, must be summed over all nodes and redistributed. 数据并行，每个节点计算自己数据的subgradient，跟完整的参数是同样的维度，需要累加所有node的subgradient，然后重新分发
Applied directly to typical training configurations, this process is infeasible due to the high bandwidth that it takes to exchange sub-minibatch gradients across nodes. 应用distributed replicas来训练，是不可行的，因为需要很大的带宽来交换所有nodes的gradient
Avenues for improving efficiency for data parallelism are to increase the minibatch size and to reduce how much data gets exchanged. 提高数据并行效率的解决办法是，增大minibatch大小；减小数据交换大小
propose to reduce bandwidth by aggressively quantizing the sub-gradients—to but one bit per value. We show that this does not or almost not reduce word accuracies—but only if the quantization error is carried forward across minibatches, i.e. the error in quantizing the gradient in one minibatch is added (fed back) to the gradient of the next minibatch. 本文提出，对subgradient进行量化，每个值压缩到一个bit。只要本minibatch的量化误差加到下一个minibatch的gradient上，就几乎不会带来精确度的下降。
数据并行不会改变convergence behavior，也就是clock vs objective。也与Hogwild/ASGD不同，本文着重的是确定性的收敛behavior。
In this category, an alternative to data parallelism is model parallelism, where models are distributed over nodes. One can also parallelize over layers [19]: Each GPU processes one or more consecutive layers, where data flows up and down through the layers between GPUs. 在上面说的这一类中，数据并行之外的选择是模型并行，将模型分布在不同节点上。也可以将网络层并行，每隔GPU处理一个或多个连续的网络层，数据在不同GPU的层之间流动
That work showed, however, that delayed updates can work, and motivated the double-buffering technique we apply in this paper. 上面提到的工作显示了，有delay的update也能够work，这激发了本文使用的double buffering方法

3. Data-parallel Deterministically Distributed SGD

CD-DNN-HMM model

3.1. Data-parallel Distributed SGD

当计算和通信完全overlap，也就是计算时间和通信时间相等时，是最优的节点数量
时间开销分为四部分：处理每条数据的时间（每隔layer是三个矩阵相乘）；计算完梯度之后的post-processing（momentum+AdaGrad），是component-wise操作（比如加减）；交换float类型的subgradient的时间；将梯度更新到模型参数上的开销，是component-wise操作，是fixed开销

3.2. Double Buffering with Half Batches

To achieve concurrent computation, we break each minibatch in half and exchange sub-gradients of one half-minibatch while computing the sub-gradients of the next half-minibatch,为了并发地计算，计算到minibatch的一半时，就交换一半的subgradients，同时计算另一半的minibatch
using a model that is outdated by N/2 samples (delayed update [19, 8]). 使用的模型有固定的delay，收敛并不会收到根本的影响

3.3. Potential Faster-Than-Fixed-Cost Communication

当通信开销低于fixed的时间开销时，网络不会被占满，因而速度被fixed cost所限制，这是1bit SGD时会发生的情况
In this case, double buffering with half-minibatches no longer makes sense, as it masks communication cost at the expense of an additional fixed cost, which is now higher. 在这种情况下，double buffering就没有必要了，因为它带来了额外的开销，而此时网络开销是很小的，网络开销的减少带来的效果不大

3.4. Relation to Hogwild/ASGD

Hogwild differs in that it uses an unsynchronized gradient exchange (via a parameter server). It is another form of delayed update where the delay varies non-deterministically across model parameters. Hogwild的不同是，它是另一种delayed update形式，模型参数的不同维度的delay是不确定的.

4. 1-bit SGD with Error Feedback

量化误差是不可避免的，可能会导致发散
参考Sigma-Delta modulation，当量化一个参数的梯度时，保存量化误差，在下一个量化之前，加到下一个minibatch的梯度中
We find that as long as error feedback is used, we can quantize all the way to 1 bit at no or nearly no loss of accuracy. 只要使用error feedback，能够一直量化到1bit，也几乎不会降低准确率
For our 1-bit implementation, we find that using a constant quantization threshold of 0 is a good (and cheap) choice, whereas the reconstruction values used by the unquantizer Q􀀀1(�) are tied within each weight-matrix column (j, l). The two values per column are recomputed as to minimize the
square quantization error and transmitted in each data exchange. 1bit的量化实现，我们发现阈值设为0就很好，但是重构值需要和参数矩阵的每个列绑定。一个列用两个值就可以使得重构的时候最小化量化平方误差

4.1. Aggregating the Gradients

Each compute node is responsible for aggregating a 1=K-th subset of model parameters, which it will receive in quantized form from all peer nodes每个计算节点负责汇总1/K个模型参数的子集，从其他节点获得量化后的值
这些量化值被累加，post-processed（AdaGrad, momentum），然后重新分发给计算节点，然后再被量化。这样每个minibatch的gradient被两次量化，
The first quantization is applied to sub-gradients which are summed up, reducing the quantization error through averaging. The second quantization happens after AdaGrad, where gradient values are in more homogeneous numeric range. 第一次量化是对subgradient，通过averaging来减小量化误差；第二次量化发生在AdaGrad之后，这样gradient的不同维度的值有不均匀的数值范围

5. System Description

用前45分钟的数据选择合适的minibatch
根据cross validation set上的准确率decay learning rate
使用AdaGrad，根据不同维度随时间的变化来normalize gradients

6. Experimental Results

6.1. Cost Measurements

不依赖于batch size的开销，gradient post-processing（Adagrad，momentum）和fixed cost（模型更新）的时间开销是基本固定的

6.2. Effect of 1-bit Quantization

1-bit quantization works well across all setups, at minor but consistent impact on training-set frame accuracy.
double buffering has minor impact on accuracy

7. Conclusion

1-bit quantization allows to significantly reduce data-exchange bandwidth for data-parallel SGD at no or nearly no loss of accuracy, making data-parallel distribution of SGD feasible even with modern fast hardware (GPUs). 1-bit量化能够减少数据并行SGD的数据交换，几乎不影响准确率
For this to work, quantization-error feedback is essential. 量化误差的误差反馈很重要
量化能够和AdaGrad互相作用互相影响

最后编辑于：2017.12.03 05:04:32

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 158,117评论 4赞 360
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 66,963评论 1赞 290
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 107,897评论 0赞 240
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,805评论 0赞 203
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,208评论 3赞 286
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,535评论 1赞 216
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,797评论 2赞 311
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,493评论 0赞 197
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,215评论 1赞 241
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,477评论 2赞 244
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 31,988评论 1赞 258
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,325评论 2赞 252
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 32,971评论 3赞 235
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,055评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,807评论 0赞 194
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,544评论 2赞 271
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,455评论 2赞 266

1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs

1. Abstract

2. Intro

3. Data-parallel Deterministically Distributed SGD

3.1. Data-parallel Distributed SGD

3.2. Double Buffering with Half Batches

3.3. Potential Faster-Than-Fixed-Cost Communication

3.4. Relation to Hogwild/ASGD

4. 1-bit SGD with Error Feedback

4.1. Aggregating the Gradients

5. System Description

6. Experimental Results

6.1. Cost Measurements

6.2. Effect of 1-bit Quantization

7. Conclusion

推荐阅读更多精彩内容