Zero Aware Configurable Data Encoding by Skipping Transfer for Error Resilient Applications

Data transfer across DRAM channels accounts for nearly a quarter of the total energy consumption of DDR4 DRAMs. Modern applications with high bandwidth requirements further increase channel energy consumption. However, channel energy consumption is dependent on data being transferred. Pseudo Open Drain (POD) asymmetric termination, used in current DDR4 systems, consumes energy only when 1’s are being transmitted over the channels. Many modern applications, including AI/ML ones are resilient to errors in data, and can work well with approximate data. This resilience can vary widely across and within applications, which provides a number of ways for exploiting these characteristics to save data transfer energy across the DRAM channel. However, all DRAM data encoding schemes have been targeted towards applications that require exact data and are not approximation resilient. In this paper, we propose Zero Aware Configurable Data Encoding by Skipping Transfer (ZAC-DEST), a data encoding scheme to reduce the energy consumption of DRAM channels, specifically targeted towards approximate computing and error resilient applications. ZAC-DEST exploits the similarity between recent data transfers across channels and information about error resilience behaviour of applications to reduce on-die termination and switching energy by reducing the number of 1’s transmitted over the channels. ZAC-DEST also provides a number of knobs for trading off application’s accuracy for energy savings, and vice versa, and can be applied to both training and inference. We apply ZAC-DEST to five machine learning applications. On average, across all applications and configurations, we observed a reduction of 40% in termination energy and 37% in switching energy as compared to the state of the art data encoding technique BD-Coder with an average output quality loss of 10%. We show that if both training and testing are done assuming the presence of ZAC-DEST, the output quality of the applications can be improved upto $9\times $ as compared to when ZAC-DEST is only applied during testing leading to energy savings during training and inference with increased output quality.

[1]  Jun Yang,et al.  Frequent value encoding for low power data buses , 2004, TODE.

[2]  Song Liu,et al.  Flikker: saving DRAM refresh-power through critical data partitioning , 2011, ASPLOS XVI.

[3]  Kaushik Roy,et al.  Approximate computing and the quest for computing efficiency , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[4]  Swagath Venkataramani,et al.  Exploiting approximate computing for deep learning acceleration , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[5]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[6]  Sparsh Mittal,et al.  A Survey of Techniques for Approximate Computing , 2016, ACM Comput. Surv..

[7]  Martin C. Rinard,et al.  Reducing serial I/O power in error-tolerant applications by efficient lossy encoding , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[8]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[9]  Chris Fallin,et al.  Memory power management via dynamic voltage/frequency scaling , 2011, ICAC '11.

[10]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Mahdi Nazm Bojnordi,et al.  STFL-DDR: Improving the Energy-Efficiency of Memory Interface , 2020, IEEE Transactions on Computers.

[14]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[15]  Onur Mutlu,et al.  EDEN: Enabling Energy-Efficient, High-Performance Deep Neural Network Inference Using Approximate DRAM , 2019, MICRO.

[16]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[17]  Anand Raghunathan,et al.  AXSERBUS: A quality-configurable approximate serial bus for energy-efficient sensing , 2017, 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[18]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[19]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Niladrish Chatterjee,et al.  Reducing Data Transfer Energy by Exploiting Similarity within a Data Transaction , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[21]  Mario Badr,et al.  Load Value Approximation , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[22]  Enrico Macii,et al.  Serial T0: Approximate bus encoding for energy-efficient transmission of sensor signals , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[23]  Jun-Seok Park,et al.  A 1.2 V 30 nm 3.2 Gb/s/pin 4 Gb DDR4 SDRAM With Dual-Error Detection and PVT-Tolerant Data-Fetch Scheme , 2012, IEEE Journal of Solid-State Circuits.

[24]  Bruce Jacob,et al.  A performance & power comparison of modern high-speed DRAM architectures , 2018, MEMSYS.

[25]  Onur Mutlu,et al.  A case for toggle-aware compression for GPU systems , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[26]  SILENT: serialized low energy transmission coding for on-chip interconnection networks , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..

[27]  Bingsheng He,et al.  ThunderSVM: A Fast SVM Library on GPUs and CPUs , 2018, J. Mach. Learn. Res..

[28]  Barzan Mozafari,et al.  BlinkML : Approximate Machine Learning with Probabilistic Guarantees , 2018 .

[29]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[30]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[31]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[32]  Dan Grossman,et al.  EnerJ: approximate data types for safe and general low-power computation , 2011, PLDI '11.

[33]  Aamer Jaleel,et al.  ACCORD: Enabling Associativity for Gigascale DRAM Caches by Coordinating Way-Install and Way-Prediction , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[34]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Swagath Venkataramani,et al.  Efficient AI System Design With Cross-Layer Approximate Computing , 2020, Proceedings of the IEEE.

[36]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Bruce Jacob,et al.  Memory Systems: Cache, DRAM, Disk , 2007 .

[38]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[39]  Anand Raghunathan,et al.  AxBA: An Approximate Bus Architecture Framework , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[40]  Onur Mutlu,et al.  Rollback-free value prediction with approximate loads , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[41]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[42]  Jung-Hwan Choi,et al.  A Sub-1.0V 20nm 5Gb/s/pin post-LPDDR3 I/O interface with Low Voltage-Swing Terminated Logic and adaptive calibration scheme for mobile application , 2013, 2013 Symposium on VLSI Circuits.

[43]  Kaushik Roy,et al.  Analysis and characterization of inherent application resilience for approximate computing , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[44]  Enrico Macii,et al.  Approximate Energy-Efficient Encoding for Serial Interfaces , 2017, ACM Trans. Design Autom. Electr. Syst..

[45]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[46]  Jun Yang,et al.  VALVE: variable length value encoder for off-chip data buses , 2005, 2005 International Conference on Computer Design.

[47]  Luis Ceze,et al.  Architecture support for disciplined approximate programming , 2012, ASPLOS XVII.

[48]  Jacob Nelson,et al.  Approximate storage in solid-state memories , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[49]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[50]  Jian Sun,et al.  Deep Learning with Low Precision by Half-Wave Gaussian Quantization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Wongyu Shin,et al.  Energy Efficient Data Encoding in DRAM Channels Exploiting Data Value Similarity , 2016, International Symposium on Computer Architecture.

[52]  Rahul Boyapati,et al.  APPROX-NoC: A data approximation framework for Network-on-Chip architectures , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[53]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[54]  Behzad Zeinali,et al.  Progressive Scaled STT-RAM for Approximate Computing in Multimedia Applications , 2018, IEEE Transactions on Circuits and Systems II: Express Briefs.

[55]  Mircea R. Stan,et al.  Bus-invert coding for low-power I/O , 1995, IEEE Trans. Very Large Scale Integr. Syst..

[56]  Onur Mutlu,et al.  What Your DRAM Power Models Are Not Telling You: Lessons from a Detailed Experimental Study , 2018, SIGMETRICS.

[57]  Kenny Gruchalla,et al.  Analysis of Application Power and Schedule Composition in a High Performance Computing Environment , 2016 .

[58]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2014, IEEE Micro.