Long Live TIME: Improving Lifetime for Training-In-Memory Engines by Structured Gradient Sparsification

Deeper and larger Neural Networks (NNs) have made breakthroughs in many fields. While conventional CMOS-based computing platforms are hard to achieve higher energy efficiency. RRAM-based systems provide a promising solution to build efficient Training-In-Memory Engines (TIME). While the endurance of RRAM cells is limited, it’s a severe issue as the weights of NN always need to be updated for thousands to millions of times during training. Gradient sparsification can address this problem by dropping off most of the smaller gradients but introduce unacceptable computation cost. We proposed an effective framework, SGS-ARS, including Structured Gradient Sparsification (SGS) and Aging-aware Row Swapping (ARS) scheme, to guarantee write balance across whole RRAM crossbars and prolong the lifetime of TIME. Our experiments demonstrate that 356× lifetime extension is achieved when TIME is programmed to train ResNet-50 on Imagenet dataset with our SGS-ARS framework.

[1]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[2]  Xuefei Ning,et al.  Fault-tolerant training with on-line fault detection for RRAM-based neural computing systems , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[3]  Yu Wang,et al.  TIME: A training-in-memory architecture for memristor-based deep neural networks , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[4]  Kenneth Heafield,et al.  Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[5]  Onur Mutlu,et al.  Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds , 2017, NSDI.

[6]  Yiran Chen,et al.  PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[7]  N. Cady,et al.  Nanoscale Hafnium Oxide RRAM Devices Exhibit Pulse Dependent Behavior and Multi-level Resistance Capability , 2016 .

[8]  G. Ackland Rapid Equilibration by algorithmic quenching the ringing mode in molecular dynamics , 2016 .

[9]  Yu Wang,et al.  Low power Convolutional Neural Networks on a chip , 2016, 2016 IEEE International Symposium on Circuits and Systems (ISCAS).

[10]  M. Mitchell Waldrop,et al.  The chips are down for Moore’s law , 2016, Nature.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[13]  Y. Leblebici,et al.  Large-scale neural networks implemented with non-volatile memory as the synaptic weight element: Comparative performance analysis (accuracy, speed, and power) , 2015, 2015 IEEE International Electron Devices Meeting (IEDM).

[14]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[17]  Chung-Wei Hsu,et al.  Self-rectifying bipolar TaOx/TiO2 RRAM with superior endurance over 1012 cycles for 3D high-density storage-class memory , 2013, 2013 Symposium on VLSI Technology.

[18]  Albert Chin,et al.  Novel Ultra-low power RRAM with good endurance and retention , 2010, 2010 Symposium on VLSI Technology.