论文信息 - Enabling Scientific Computing on Memristive Accelerators

Enabling Scientific Computing on Memristive Accelerators

Linear algebra is ubiquitous across virtually every field of science and engineering, from climate modeling to macroeconomics. This ubiquity makes linear algebra a prime candidate for hardware acceleration, which can improve both the run time and the energy efficiency of a wide range of scientific applications. Recent work on memristive hardware accelerators shows significant potential to speed up matrix-vector multiplication (MVM), a critical linear algebra kernel at the heart of neural network inference tasks. Regrettably, the proposed hardware is constrained to a narrow range of workloads: although the eight-to 16-bit computations afforded by memristive MVM accelerators are acceptable for machine learning, they are insufficient for scientific computing where high-precision floating point is the norm. This paper presents the first proposal to enable scientific computing on memristive crossbars. Three techniques are explored — reducing overheads by exploiting exponent range locality, early termination of fixed-point computation, and static operation scheduling — that together enable a fixed-point memristive accelerator to perform high-precision floating point without the exorbitant cost of naïve floating-point emulation on fixed-point hardware. A heterogeneous collection of crossbars with varying sizes is proposed to efficiently handle sparse matrices, and an algorithm for mapping the dense subblocks of a sparse matrix to an appropriate set of crossbars is investigated. The accelerator can be combined with existing GPU-based systems to handle datasets that cannot be efficiently handled by the memristive accelerator alone. The proposed optimizations permit the memristive MVM concept to be applied to a wide range of problem domains, respectively improving the execution time and energy dissipation of sparse linear solvers by 10.3x and 10.9x over a purely GPU-based system.

[1] Richard Vuduc,et al. Automatic performance tuning of sparse matrix kernels , 2003 .

[2] Z. Wei,et al. Highly reliable TaOx ReRAM and direct evidence of redox reaction mechanism , 2008, 2008 IEEE International Electron Devices Meeting.

[3] Engin Ipek,et al. Memristive Boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning , 2017 .

[4] Yusuf Leblebici,et al. A 3.1 mW 8b 1.2 GS/s Single-Channel Asynchronous SAR ADC With Alternate Comparators for Enhanced Speed in 32 nm Digital SOI CMOS , 2013, IEEE Journal of Solid-State Circuits.

[5] Eric S. Chung,et al. Towards a Universal FPGA Matrix-Vector Multiplication Architecture , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[6] Florin Pop. High Performance Numerical Computing for High Energy Physics: A New Challenge for Big Data Science , 2014 .

[7] V. Springel,et al. Properties of galaxies reproduced by a hydrodynamic simulation , 2014, Nature.

[8] M. Hestenes,et al. Methods of conjugate gradients for solving linear systems , 1952 .

[9] Cong Xu,et al. Design of cross-point metal-oxide ReRAM emphasizing reliability and cost , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[10] William Rhett Davis,et al. FreePDK15: An Open-Source Predictive Process Design Kit for 15nm FinFET Technology , 2015, ISPD.

[11] Dejan Markovic,et al. A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs , 2014, FPGA.

[12] Yiran Chen,et al. GraphR: Accelerating Graph Processing Using ReRAM , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[13] R. Pielke. Mesoscale Meteorological Modeling , 1984 .

[14] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[15] Bernhard Schölkopf,et al. Kernel Methods in Computational Biology , 2005 .

[16] Paul Messina,et al. The Exascale Computing Project , 2017, Comput. Sci. Eng..

[17] Nam Sung Kim,et al. GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[18] Tao Zhang,et al. Overcoming the challenges of crossbar resistive memory architectures , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[19] J. Yang,et al. High switching endurance in TaOx memristive devices , 2010 .

[20] Chung-Wei Hsu,et al. Self-rectifying bipolar TaOx/TiO2 RRAM with superior endurance over 1012 cycles for 3D high-density storage-class memory , 2013, 2013 Symposium on VLSI Technology.

[21] Lilia Maliar,et al. Numerical Methods for Large-Scale Dynamic Economic Models , 2014 .

[22] Jack J. Dongarra,et al. The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[23] Chris Yakopcic,et al. Model for maximum crossbar size based on input driver impedance , 2016 .

[24] Henk A. van der Vorst,et al. Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems , 1992, SIAM J. Sci. Comput..

[25] R. Sarpeshkar,et al. A 10-nW 12-bit accurate analog storage cell with 10-aA leakage , 2004, IEEE Journal of Solid-State Circuits.

[26] James Demmel,et al. IEEE Standard for Floating-Point Arithmetic , 2008 .

[27] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[28] Y. Saad,et al. GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[29] Franz Franchetti,et al. Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[30] Gokcen Kestor,et al. Quantifying the energy cost of data movement in scientific applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[31] Jia Wang,et al. DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[32] David H. Bailey,et al. High-precision floating-point arithmetic in scientific computation , 2004, Computing in Science & Engineering.

[33] Wouter A. Serdijn,et al. Analysis of Power Consumption and Linearity in Capacitive Digital-to-Analog Converters Used in Successive Approximation ADCs , 2011, IEEE Transactions on Circuits and Systems I: Regular Papers.

[34] Catherine Graves,et al. Dot-product engine for neuromorphic computing: Programming 1T1M crossbar to accelerate matrix-vector multiplication , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[35] Jack J. Dongarra,et al. Efficiency of General Krylov Methods on GPUs -- An Experimental Study , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[36] Engin Ipek,et al. Making Memristive Neural Network Accelerators Reliable , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[37] Yousef Saad,et al. Iterative methods for sparse linear systems , 2003 .

[38] Mayler G. A. Martins,et al. Open Cell Library in 15nm FreePDK Technology , 2015, ISPD.

[39] Ligang Gao,et al. High precision tuning of state for memristive devices by adaptable variation-tolerant algorithm , 2011, Nanotechnology.

[40] Miao Hu,et al. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[41] Mark Horowitz,et al. FPU Generator for Design Space Exploration , 2013, 2013 IEEE 21st Symposium on Computer Arithmetic.

[42] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.

[43] Shyh-Chyi Wong,et al. Modeling of interconnect capacitance, delay, and crosstalk in VLSI , 2000 .

[44] Andrew B. Kahng,et al. CACTI 7 , 2017, ACM Trans. Archit. Code Optim..

[45] L. V. Allis,et al. Searching for solutions in games and artificial intelligence , 1994 .

[46] Mircea R. Stan,et al. Bus-invert coding for low-power I/O , 1995, IEEE Trans. Very Large Scale Integr. Syst..

[47] Tao Zhang,et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[48] Alex Fit-Florea,et al. Precision and Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs , 2011 .

[49] Saibal Mukhopadhyay,et al. A programmable hardware accelerator for simulating dynamical systems , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[50] Thomas Toifl,et al. 28.5 A 10b 1.5GS/s pipelined-SAR ADC with background second-stage common-mode regulation and offset calibration in 14nm CMOS FinFET , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[51] Karin Strauss,et al. A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[52] Subramanian S. Iyer,et al. A 14 nm 1.1 Mb Embedded DRAM Macro With 1 ns Access , 2016, IEEE Journal of Solid-State Circuits.

[53] Yiran Chen,et al. PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).