We introduce “deep compression”, a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35× to 49× without affecting their accuracy.
Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman coding.
Network Pruning：We start by learning the connectivity via normal network training. Next, we prune the small weight connections: all connections with weights below a threshold are removed from the network. Finally, we retrain the network to learn the final weights for the remaining sparse connections. Pruning reduced the number of parameters by 9× and 13× for AlexNet and VGG 16 model.
Weight sharing：we use k-means clustering to identify the shared weights for each layer of a trained network, so that all the weights that fall into the same cluster will share the same weight.
D EEP C OMPRESSION : C OMPRESSING D EEP N EURALN ETWORKS WITH PRUNING , T RAINED Q UANTIZATION AND H UFFMAN C ODING， Song Han, 2016, ICLR