Deep Compression

Approach

We introduce “deep compression”, a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35× to 49× without affecting their accuracy.

Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman coding.


  • Network Pruning:We start by learning the connectivity via normal network training. Next, we prune the small weight connections: all connections with weights below a threshold are removed from the network. Finally, we retrain the network to learn the final weights for the remaining sparse connections. Pruning reduced the number of parameters by 9× and 13× for AlexNet and VGG 16 model.

  • Weight sharing:we use k-means clustering to identify the shared weights for each layer of a trained network, so that all the weights that fall into the same cluster will share the same weight.

Experiment

References:
D EEP C OMPRESSION : C OMPRESSING D EEP N EURALN ETWORKS WITH PRUNING , T RAINED Q UANTIZATION AND H UFFMAN C ODING, Song Han, 2016, ICLR

推荐阅读更多精彩内容