O-2A: Low Overhead DNN Compression with Outlier-Aware Approximation

We present a low-latency DNN compression technique to reduce DRAM energy, significant in DNN inferences, namely Outlier-Aware Approximation (O-2A) coding. This technique compresses 8-bit integer, de-facto standard of DNN inferences, to 6-bit without degrading the accuracies of DNNs. The hardware for the O-2A coding can be easily embedded to DRAM controllers due to small overhead. In an Eyeriss platform, the O-2A coding improves both DRAM energy and system performance by 18~20%. The O-2A coding enables us to implement an error-correction scheme without additional parity overhead, opening the possibility of an approximate DRAM to simultaneously reduce DRAM accessing and refresh energy.

[1]  Eunhyeok Park,et al.  Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[2]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[3]  Swagath Venkataramani,et al.  BiScaled-DNN: Quantizing Long-tailed Datastructures with Two Scale Factors for Deep Neural Networks , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[4]  Jae-Joon Han,et al.  Learning to Quantize Deep Networks by Optimizing Quantization Intervals With Task Loss , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[6]  Shuchang Zhou,et al.  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[7]  Matthew Mattina,et al.  SCALE-Sim: Systolic CNN Accelerator , 2018, ArXiv.

[8]  Song Han,et al.  Trained Ternary Quantization , 2016, ICLR.

[9]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[10]  Eunhyeok Park,et al.  Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications , 2015, ICLR.

[11]  Sung-Kye Park,et al.  Technology Scaling Challenge and Future Prospects of DRAM and NAND Flash Memory , 2015, 2015 IEEE International Memory Workshop (IMW).

[12]  Luca Benini,et al.  EBPC: Extended Bit-Plane Compression for Deep Neural Network Inference and Training Accelerators , 2019, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[13]  Daniel Soudry,et al.  Post training 4-bit quantization of convolutional networks for rapid-deployment , 2018, NeurIPS.