Rapid in-memory matrix multiplication using associative processor

Memory hierarchy latency is one of the main problems that prevents processors from achieving high performance. To eliminate the need of loading/storing large sets of data, Resistive Associative Processors (ReAP) have been proposed as a solution to the von Neumann bottleneck. In ReAPs, logic and memory structures are combined together to allow inmemory computations. In this paper, we propose a new algorithm to compute the matrix multiplication inside the memory that exploits the benefits of ReAP. The proposed approach is based on the Cannon algorithm and uses a series of rotations without duplicating the data. It runs in O(n), where n is the dimension of the matrix. The method also applies to a large set of row by column matrix-based applications. Experimental results show several orders of magnitude increase in performance and reduction in energy and area when compared to the latest FPGA and CPU implementations.

[1]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[2]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[3]  Nader H. Bshouty A lower bound for matrix multiplication , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[4]  Ran Ginosar,et al.  Efficient Dense and Sparse Matrix Multiplication on GP-SIMD , 2014, 2014 24th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS).

[5]  Ahmed M. Eltawil,et al.  Process variations-aware resistive associative processor design , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[6]  Jiang Jiang,et al.  Matrix Multiplication Based on Scalable Macro-Pipelined FPGA Accelerator Architecture , 2009, 2009 International Conference on Reconfigurable Computing and FPGAs.

[7]  Veljko M. Milutinovic,et al.  FPGA accelerator for floating-point matrix multiplication , 2012, IET Comput. Digit. Tech..

[8]  Ran Ginosar,et al.  Sparse Matrix Multiplication On An Associative Processor , 2015, IEEE Transactions on Parallel and Distributed Systems.

[9]  Jintao Yu,et al.  Parallel matrix multiplication on memristor-based computation-in-memory architecture , 2016, 2016 International Conference on High Performance Computing & Simulation (HPCS).

[10]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[11]  Raphael Yuster,et al.  All-pairs bottleneck paths for general graphs in truly sub-cubic time , 2007, STOC '07.

[12]  Peilin Song,et al.  1Mb 0.41 µm2 2T-2R cell nonvolatile TCAM with two-bit encoding and clocked self-referenced sensing , 2013, 2013 Symposium on VLSI Circuits.

[13]  V. Strassen Gaussian elimination is not optimal , 1969 .

[14]  Viktor K. Prasanna,et al.  Energy- and time-efficient matrix multiplication on FPGAs , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[15]  Cheng Xu,et al.  An Optimized Floating-Point Matrix Multiplication on FPGA , 2013 .

[16]  Virginia Vassilevska Williams,et al.  Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.