Reliable and Energy Efficient MLC STT-RAM Buffer for CNN Accelerators

We propose a lightweight scheme where the formation of a data block is changed in such a way that it can tolerate soft errors significantly better than the baseline. The key insight behind our work is that CNN weights are normalized between -1 and 1 after each convolutional layer, and this leaves one bit unused in half-precision floating-point representation. By taking advantage of the unused bit, we create a backup for the most significant bit to protect it against the soft errors. Also, considering the fact that in MLC STT-RAMs the cost of memory operations (read and write), and reliability of a cell are content-dependent (some patterns take larger current and longer time, while they are more susceptible to soft error), we rearrange the data block to minimize the number of costly bit patterns. Combining these two techniques provides the same level of accuracy compared to an error-free baseline while improving the read and write energy by 9% and 6%, respectively.

[1]  Kaushik Roy,et al.  Approximate storage for energy efficient spintronic memories , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[2]  Kaushik Roy,et al.  Write-optimized reliable design of STT MRAM , 2012, ISLPED '12.

[3]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[4]  Ning Ge,et al.  TriZone: A Design of MLC STT-RAM Cache for Combined Performance, Energy, and Reliability Optimizations , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[5]  Cong Xu,et al.  NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[6]  Yang Hu,et al.  Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[7]  Scott A. Mahlke,et al.  In-Memory Data Parallel Processor , 2018, ASPLOS.

[8]  Natalie D. Enright Jerger,et al.  Reduced-Precision Strategies for Bounded Memory in Deep Neural Nets , 2015, ArXiv.

[9]  Matthew Mattina,et al.  SCALE-Sim: Systolic CNN Accelerator , 2018, ArXiv.

[10]  Jason Cong,et al.  Designing scratchpad memory architecture with emerging STT-RAM memory technologies , 2013, 2013 IEEE International Symposium on Circuits and Systems (ISCAS2013).

[11]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[12]  Yiran Chen,et al.  Multi-level cell STT-RAM: Is it realistic or just a dream? , 2012, 2012 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[13]  Masoomeh Jasemi,et al.  Enhancing Reliability of Emerging MemoryTechnology for Machine Learning Accelerators , 2020 .

[14]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[15]  Jacques-Olivier Klein,et al.  Spin-Transfer Torque Magnetic Memory as a Stochastic Memristive Synapse for Neuromorphic Systems , 2015, IEEE Transactions on Biomedical Circuits and Systems.

[16]  Liu Liu,et al.  Building energy-efficient multi-level cell STT-RAM caches with data compression , 2017, 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC).

[17]  Yanzhi Wang,et al.  A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers , 2018, ECCV.

[18]  Masoomeh Jasemi,et al.  NoC Design Methodologies for Heterogeneous Architecture , 2020, 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP).

[19]  Yiran Chen,et al.  PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[20]  Olivier Temam,et al.  Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).

[21]  Hao Yan,et al.  CELIA: A Device and Architecture Co-Design Framework for STT-MRAM-Based Deep Learning Acceleration , 2018, ICS.

[22]  Gang Quan,et al.  A statistical STT-RAM retention model for fast memory subsystem designs , 2017, 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC).

[23]  MahlkeScott,et al.  In-Memory Data Parallel Processor , 2018 .

[24]  Pradeep Dubey,et al.  SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[25]  Jie Xu,et al.  DeepBurning: Automatic generation of FPGA-based learning accelerators for the Neural Network family , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[26]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Masoomeh Jasemi,et al.  A Radiation Hard Sense Circuit for Spin Transfer Torque Random Access Memory , 2019, 2019 IEEE International Symposium on Circuits and Systems (ISCAS).

[28]  Gu-Yeon Wei,et al.  MaxNVM: Maximizing DNN Storage Density and Inference Efficiency with Sparse Encoding and Error Mitigation , 2019, MICRO.

[29]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Yiran Chen,et al.  State-restrict MLC STT-RAM designs for high-reliable high-performance memory system , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[31]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[32]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[33]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[34]  Deming Chen,et al.  Debugging and verifying SoC designs through effective cross-layer hardware-software co-simulation , 2016, DAC.

[35]  H. Noguchi,et al.  Progress of STT-MRAM technology and the effect on normally-off computing systems , 2012, 2012 International Electron Devices Meeting.

[36]  Chita R. Das,et al.  Cache revive: Architecting volatile STT-RAM caches for enhanced performance in CMPs , 2012, DAC Design Automation Conference 2012.

[37]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[38]  Masoomeh Jasemi,et al.  Partition Pruning: Parallelization-Aware Pruning for Dense Neural Networks , 2020, 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP).

[39]  Kaushik Roy,et al.  Future cache design using STT MRAMs for improved energy efficiency: Devices, circuits and architecture , 2012, DAC Design Automation Conference 2012.

[40]  Ying Wang,et al.  STT-RAM Buffer Design for Precision-Tunable General-Purpose Neural Network Accelerator , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[41]  Liang Shi,et al.  Two-step state transition minimization for lifetime and performance improvement on MLC STT-RAM , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[42]  H.-S. Philip Wong,et al.  On-Chip Memory Technology Design Space Explorations for Mobile Deep Neural Network Accelerators , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[43]  A. Krizhevsky Convolutional Deep Belief Networks on CIFAR-10 , 2010 .

[44]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.