论文信息 - Rapid in-memory matrix multiplication using associative processor

Rapid in-memory matrix multiplication using associative processor

Memory hierarchy latency is one of the main problems that prevents processors from achieving high performance. To eliminate the need of loading/storing large sets of data, Resistive Associative Processors (ReAP) have been proposed as a solution to the von Neumann bottleneck. In ReAPs, logic and memory structures are combined together to allow inmemory computations. In this paper, we propose a new algorithm to compute the matrix multiplication inside the memory that exploits the benefits of ReAP. The proposed approach is based on the Cannon algorithm and uses a series of rotations without duplicating the data. It runs in O(n), where n is the dimension of the matrix. The method also applies to a large set of row by column matrix-based applications. Experimental results show several orders of magnitude increase in performance and reduction in energy and area when compared to the latest FPGA and CPU implementations.

[1] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .

[2] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[3] Nader H. Bshouty. A lower bound for matrix multiplication , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[4] Ran Ginosar,et al. Efficient Dense and Sparse Matrix Multiplication on GP-SIMD , 2014, 2014 24th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS).

[5] Ahmed M. Eltawil,et al. Process variations-aware resistive associative processor design , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[6] Jiang Jiang,et al. Matrix Multiplication Based on Scalable Macro-Pipelined FPGA Accelerator Architecture , 2009, 2009 International Conference on Reconfigurable Computing and FPGAs.

[7] Veljko M. Milutinovic,et al. FPGA accelerator for floating-point matrix multiplication , 2012, IET Comput. Digit. Tech..

[8] Ran Ginosar,et al. Sparse Matrix Multiplication On An Associative Processor , 2015, IEEE Transactions on Parallel and Distributed Systems.

[9] Jintao Yu,et al. Parallel matrix multiplication on memristor-based computation-in-memory architecture , 2016, 2016 International Conference on High Performance Computing & Simulation (HPCS).

[10] Lynn Elliot Cannon,et al. A cellular computer to implement the kalman filter algorithm , 1969 .

[11] Raphael Yuster,et al. All-pairs bottleneck paths for general graphs in truly sub-cubic time , 2007, STOC '07.

[12] Peilin Song,et al. 1Mb 0.41 µm2 2T-2R cell nonvolatile TCAM with two-bit encoding and clocked self-referenced sensing , 2013, 2013 Symposium on VLSI Circuits.

[13] V. Strassen. Gaussian elimination is not optimal , 1969 .

[14] Viktor K. Prasanna,et al. Energy- and time-efficient matrix multiplication on FPGAs , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[15] Cheng Xu,et al. An Optimized Floating-Point Matrix Multiplication on FPGA , 2013 .

[16] Virginia Vassilevska Williams,et al. Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.