论文信息 - Towards a Universal FPGA Matrix-Vector Multiplication Architecture

Towards a Universal FPGA Matrix-Vector Multiplication Architecture

We present the design and implementation of a universal, single-bit stream library for accelerating matrix-vector multiplication using FPGAs. Our library handles multiple matrix encodings ranging from dense to multiple sparse formats. A key novelty in our approach is the introduction of a hardware-optimized sparse matrix representation called Compressed Variable-Length Bit Vector (CVBV), which reduces the storage and bandwidth requirements up to 43% (on average 25%) compared to compressed sparse row (CSR) across all the matrices from the University of Florida Sparse Matrix Collection. Our hardware incorporates a runtime-programmable decoder that performs on-the-fly-decoding of various formats such as Dense, COO, CSR, DIA, and ELL. The flexibility and scalability of our design is demonstrated across two FPGA platforms: (1) the BEE3 (Virtex-5 LX155T with 16GB of DRAM) and (2) ML605 (Virtex-6 LX240T with 2GB of DRAM). For dense matrices, our approach scales to large data sets with over 1 billion elements, and achieves robust performance independent of the matrix aspect ratio. For sparse matrices, our approach using a compressed representation reduces the overall bandwidth while also achieving comparable efficiency relative to state-of-the-art approaches.

[1] Jason D. Bakos,et al. Exploiting Matrix Symmetry to Improve FPGA-Accelerated Conjugate Gradient , 2009, 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines.

[2] Steve Poole,et al. Sparse Matrix-Vector Multiplication on a Reconfigurable Supercomputer , 2008, 2008 16th International Symposium on Field-Programmable Custom Computing Machines.

[3] Shanq-Jang Ruan,et al. Sparse Matrix-Vector Multiplication Based on Network-on-Chip in FPGA , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[4] André DeHon,et al. Floating-point sparse matrix-vector multiply for FPGAs , 2005, FPGA '05.

[5] John D. Davis,et al. BLAS Comparison on FPGA, CPU and GPU , 2010, 2010 IEEE Computer Society Annual Symposium on VLSI.

[6] Siddharth Joshi,et al. FPGA Based High Performance Double-Precision Matrix Multiplication , 2009, VLSI Design.

[7] Georgi Kuzmanov,et al. Reconfigurable sparse/dense matrix-vector multiplier , 2009, 2009 International Conference on Field-Programmable Technology.

[8] Shanq-Jang Ruan,et al. FPGA acceleration of Sparse Matrix-Vector Multiplication based on Network-on-Chip , 2011, 2011 19th European Signal Processing Conference.

[9] Yu Wang,et al. FPGA and GPU implementation of large scale SpMV , 2010, 2010 IEEE 8th Symposium on Application Specific Processors (SASP).

[10] Phillip H. Jones,et al. An I/O Bandwidth-Sensitive Sparse Matrix-Vector Multiplication Engine on FPGAs , 2012, IEEE Transactions on Circuits and Systems I: Regular Papers.

[11] John R. Gilbert,et al. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[12] Junqing Sun,et al. Mapping Sparse Matrix-Vector Multiplication on FPGAs , 2007 .

[13] James Demmel,et al. Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[14] M.E.T. Gerards. Streaming reduction circuit for sparse matrixvector multiplication in FPGAs , 2008 .

[15] David Gregg,et al. FPGA Based Sparse Matrix Vector Multiplication using Commodity DRAM Memory , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[16] Viktor K. Prasanna,et al. Sparse Matrix-Vector multiplication on FPGAs , 2005, FPGA '05.

[17] Youcef Saad,et al. A Basic Tool Kit for Sparse Matrix Computations , 1990 .

[18] Samuel Williams,et al. Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[19] Michael Garland,et al. Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[20] Viktor K. Prasanna,et al. High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs , 2007, IEEE Transactions on Parallel and Distributed Systems.

[21] Chen Chang,et al. BEE3: Revitalizing Computer Architecture Research , 2009 .

[22] Yan Zhang,et al. FPGA vs. GPU for sparse matrix vector multiply , 2009, 2009 International Conference on Field-Programmable Technology.

[23] Jason D. Bakos,et al. A Sparse Matrix Personality for the Convey HC-1 , 2011, 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines.

[24] Sadaf R. Alam,et al. Scientific Computing Beyond CPUs: FPGA implementations of common scientific kernels , 2005 .

[25] Marcel van der Veen. Sparse matrix vector multiplication on a field programmable gate array , 2007 .

[26] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.