Power- and Endurance-Aware Neural Network Training in NVM-Based Platforms

Neural networks (NNs) have become the go-to tool for solving many real-world recognition and classification tasks with massive and complex data sets. These networks require large data sets for training, which is usually performed on GPUs and CPUs in either a cloud or edge computing setting. No matter where the training is performed, it is subject to tight power/energy and data storage/transfer constraints. While these issues can be mitigated by replacing SRAM/DRAM with nonvolatile memories (NVMs) which offer near-zero leakage power and high scalability, the massive weight updates performed during training shorten NVM endurance and engender high write energy. In this paper, an NVM-friendly NN training approach is proposed. Weight update is redesigned to reduce bit flips in NVM cells. Moreover, two techniques, namely, filter exchange and bitwise rotation, are proposed to respectively balance writes to different weights and to different bits of one weight. The proposed techniques are integrated and evaluated in Caffe. Experimental results show significant power savings and endurance improvements, while maintaining high inference accuracy.

[1]  Hyunjin Lee,et al.  Flip-N-Write: A simple deterministic technique to improve PRAM write performance, energy and endurance , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Narayanan Vijaykrishnan,et al.  Architecture exploration for ambient energy harvesting nonvolatile processors , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[3]  Rami G. Melhem,et al.  Analyzing the impact of useless write-backs on the endurance and energy consumption of PCM main memory , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[4]  Cong Xu,et al.  NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[5]  Shimeng Yu,et al.  Metal–Oxide RRAM , 2012, Proceedings of the IEEE.

[6]  Ligang Gao,et al.  Energy-Efficient Adaptive Computing With Multifunctional Memory , 2017, IEEE Transactions on Circuits and Systems II: Express Briefs.

[7]  Yiran Chen,et al.  MoDNN: Local distributed mobile computing system for Deep Neural Network , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[8]  Yu Wang,et al.  Training low bitwidth convolutional neural network on RRAM , 2018, 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC).

[9]  R. E. Uhrig,et al.  Introduction to artificial neural networks , 1995, Proceedings of IECON '95 - 21st Annual Conference on IEEE Industrial Electronics.

[10]  Qingfeng Zhuge,et al.  Non-volatile registers aware instruction selection for embedded systems , 2014, 2014 IEEE 20th International Conference on Embedded and Real-Time Computing Systems and Applications.

[11]  Yiran Chen,et al.  Accelerator-friendly neural-network training: Learning variations and defects in RRAM crossbar , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[12]  Gernot A. Fink,et al.  Exploring Pattern Selection Strategies for Fast Neural Network Training , 2010, 2010 20th International Conference on Pattern Recognition.

[13]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[14]  Badong Chen,et al.  Quantized Kernel Least Mean Square Algorithm , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[15]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[16]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[17]  Chao Wang,et al.  NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[18]  Jason Cong,et al.  FPGA-RPI: A Novel FPGA Architecture With RRAM-Based Programmable Interconnects , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[19]  Mircea R. Stan,et al.  The Promise of Nanomagnetics and Spintronics for Future Logic and Universal Memory , 2010, Proceedings of the IEEE.

[20]  Rami G. Melhem,et al.  Writeback-aware partitioning and replacement for last-level caches in phase change main memory systems , 2012, TACO.

[21]  Chia-Feng Juang,et al.  Speedup of Implementing Fuzzy Neural Networks With High-Dimensional Inputs Through Parallel Processing on Graphic Processing Units , 2011, IEEE Transactions on Fuzzy Systems.

[22]  W. Arden The International Technology Roadmap for Semiconductors—Perspectives and challenges for the next 15 years , 2002 .

[23]  David B. Thomas,et al.  Increasing Network Size and Training Throughput of FPGA Restricted Boltzmann Machines Using Dropout , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[24]  Baoxin Li,et al.  Strategies for Re-Training a Pruned Neural Network in an Edge Computing Paradigm , 2017, 2017 IEEE International Conference on Edge Computing (EDGE).

[25]  Yu Wang,et al.  Switched by input: Power efficient structure for RRAM-based convolutional neural network , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[26]  Vijayalakshmi Srinivasan,et al.  Enhancing lifetime and security of PCM-based Main Memory with Start-Gap Wear Leveling , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  Ke Wang,et al.  A pseudoinverse incremental algorithm for fast training deep neural networks with application to spectra pattern recognition , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[28]  Qing Wu,et al.  Hardware realization of BSB recall function using memristor crossbar arrays , 2012, DAC Design Automation Conference 2012.

[29]  Yu Wang,et al.  A STT-RAM-based low-power hybrid register file for GPGPUs , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[30]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.