Enabling Scientific Computing on Memristive Accelerators

Linear algebra is ubiquitous across virtually every field of science and engineering, from climate modeling to macroeconomics. This ubiquity makes linear algebra a prime candidate for hardware acceleration, which can improve both the run time and the energy efficiency of a wide range of scientific applications. Recent work on memristive hardware accelerators shows significant potential to speed up matrix-vector multiplication (MVM), a critical linear algebra kernel at the heart of neural network inference tasks. Regrettably, the proposed hardware is constrained to a narrow range of workloads: although the eight-to 16-bit computations afforded by memristive MVM accelerators are acceptable for machine learning, they are insufficient for scientific computing where high-precision floating point is the norm. This paper presents the first proposal to enable scientific computing on memristive crossbars. Three techniques are explored — reducing overheads by exploiting exponent range locality, early termination of fixed-point computation, and static operation scheduling — that together enable a fixed-point memristive accelerator to perform high-precision floating point without the exorbitant cost of naïve floating-point emulation on fixed-point hardware. A heterogeneous collection of crossbars with varying sizes is proposed to efficiently handle sparse matrices, and an algorithm for mapping the dense subblocks of a sparse matrix to an appropriate set of crossbars is investigated. The accelerator can be combined with existing GPU-based systems to handle datasets that cannot be efficiently handled by the memristive accelerator alone. The proposed optimizations permit the memristive MVM concept to be applied to a wide range of problem domains, respectively improving the execution time and energy dissipation of sparse linear solvers by 10.3x and 10.9x over a purely GPU-based system.

[1]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[2]  Z. Wei,et al.  Highly reliable TaOx ReRAM and direct evidence of redox reaction mechanism , 2008, 2008 IEEE International Electron Devices Meeting.

[3]  Engin Ipek,et al.  Memristive Boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning , 2017 .

[4]  Yusuf Leblebici,et al.  A 3.1 mW 8b 1.2 GS/s Single-Channel Asynchronous SAR ADC With Alternate Comparators for Enhanced Speed in 32 nm Digital SOI CMOS , 2013, IEEE Journal of Solid-State Circuits.

[5]  Eric S. Chung,et al.  Towards a Universal FPGA Matrix-Vector Multiplication Architecture , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[6]  Florin Pop High Performance Numerical Computing for High Energy Physics: A New Challenge for Big Data Science , 2014 .

[7]  V. Springel,et al.  Properties of galaxies reproduced by a hydrodynamic simulation , 2014, Nature.

[8]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[9]  Cong Xu,et al.  Design of cross-point metal-oxide ReRAM emphasizing reliability and cost , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[10]  William Rhett Davis,et al.  FreePDK15: An Open-Source Predictive Process Design Kit for 15nm FinFET Technology , 2015, ISPD.

[11]  Dejan Markovic,et al.  A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs , 2014, FPGA.

[12]  Yiran Chen,et al.  GraphR: Accelerating Graph Processing Using ReRAM , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[13]  R. Pielke Mesoscale Meteorological Modeling , 1984 .

[14]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[15]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[16]  Paul Messina,et al.  The Exascale Computing Project , 2017, Comput. Sci. Eng..

[17]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[18]  Tao Zhang,et al.  Overcoming the challenges of crossbar resistive memory architectures , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[19]  J. Yang,et al.  High switching endurance in TaOx memristive devices , 2010 .

[20]  Chung-Wei Hsu,et al.  Self-rectifying bipolar TaOx/TiO2 RRAM with superior endurance over 1012 cycles for 3D high-density storage-class memory , 2013, 2013 Symposium on VLSI Technology.

[21]  Lilia Maliar,et al.  Numerical Methods for Large-Scale Dynamic Economic Models , 2014 .

[22]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[23]  Chris Yakopcic,et al.  Model for maximum crossbar size based on input driver impedance , 2016 .

[24]  Henk A. van der Vorst,et al.  Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems , 1992, SIAM J. Sci. Comput..

[25]  R. Sarpeshkar,et al.  A 10-nW 12-bit accurate analog storage cell with 10-aA leakage , 2004, IEEE Journal of Solid-State Circuits.

[26]  James Demmel,et al.  IEEE Standard for Floating-Point Arithmetic , 2008 .

[27]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[28]  Y. Saad,et al.  GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[29]  Franz Franchetti,et al.  Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[30]  Gokcen Kestor,et al.  Quantifying the energy cost of data movement in scientific applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[31]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[32]  David H. Bailey,et al.  High-precision floating-point arithmetic in scientific computation , 2004, Computing in Science & Engineering.

[33]  Wouter A. Serdijn,et al.  Analysis of Power Consumption and Linearity in Capacitive Digital-to-Analog Converters Used in Successive Approximation ADCs , 2011, IEEE Transactions on Circuits and Systems I: Regular Papers.

[34]  Catherine Graves,et al.  Dot-product engine for neuromorphic computing: Programming 1T1M crossbar to accelerate matrix-vector multiplication , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[35]  Jack J. Dongarra,et al.  Efficiency of General Krylov Methods on GPUs -- An Experimental Study , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[36]  Engin Ipek,et al.  Making Memristive Neural Network Accelerators Reliable , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[37]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[38]  Mayler G. A. Martins,et al.  Open Cell Library in 15nm FreePDK Technology , 2015, ISPD.

[39]  Ligang Gao,et al.  High precision tuning of state for memristive devices by adaptable variation-tolerant algorithm , 2011, Nanotechnology.

[40]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[41]  Mark Horowitz,et al.  FPU Generator for Design Space Exploration , 2013, 2013 IEEE 21st Symposium on Computer Arithmetic.

[42]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[43]  Shyh-Chyi Wong,et al.  Modeling of interconnect capacitance, delay, and crosstalk in VLSI , 2000 .

[44]  Andrew B. Kahng,et al.  CACTI 7 , 2017, ACM Trans. Archit. Code Optim..

[45]  L. V. Allis,et al.  Searching for solutions in games and artificial intelligence , 1994 .

[46]  Mircea R. Stan,et al.  Bus-invert coding for low-power I/O , 1995, IEEE Trans. Very Large Scale Integr. Syst..

[47]  Tao Zhang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[48]  Alex Fit-Florea,et al.  Precision and Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs , 2011 .

[49]  Saibal Mukhopadhyay,et al.  A programmable hardware accelerator for simulating dynamical systems , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[50]  Thomas Toifl,et al.  28.5 A 10b 1.5GS/s pipelined-SAR ADC with background second-stage common-mode regulation and offset calibration in 14nm CMOS FinFET , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[51]  Karin Strauss,et al.  A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[52]  Subramanian S. Iyer,et al.  A 14 nm 1.1 Mb Embedded DRAM Macro With 1 ns Access , 2016, IEEE Journal of Solid-State Circuits.

[53]  Yiran Chen,et al.  PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).