Towards a Universal FPGA Matrix-Vector Multiplication Architecture

We present the design and implementation of a universal, single-bit stream library for accelerating matrix-vector multiplication using FPGAs. Our library handles multiple matrix encodings ranging from dense to multiple sparse formats. A key novelty in our approach is the introduction of a hardware-optimized sparse matrix representation called Compressed Variable-Length Bit Vector (CVBV), which reduces the storage and bandwidth requirements up to 43% (on average 25%) compared to compressed sparse row (CSR) across all the matrices from the University of Florida Sparse Matrix Collection. Our hardware incorporates a runtime-programmable decoder that performs on-the-fly-decoding of various formats such as Dense, COO, CSR, DIA, and ELL. The flexibility and scalability of our design is demonstrated across two FPGA platforms: (1) the BEE3 (Virtex-5 LX155T with 16GB of DRAM) and (2) ML605 (Virtex-6 LX240T with 2GB of DRAM). For dense matrices, our approach scales to large data sets with over 1 billion elements, and achieves robust performance independent of the matrix aspect ratio. For sparse matrices, our approach using a compressed representation reduces the overall bandwidth while also achieving comparable efficiency relative to state-of-the-art approaches.

[1]  Jason D. Bakos,et al.  Exploiting Matrix Symmetry to Improve FPGA-Accelerated Conjugate Gradient , 2009, 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines.

[2]  Steve Poole,et al.  Sparse Matrix-Vector Multiplication on a Reconfigurable Supercomputer , 2008, 2008 16th International Symposium on Field-Programmable Custom Computing Machines.

[3]  Shanq-Jang Ruan,et al.  Sparse Matrix-Vector Multiplication Based on Network-on-Chip in FPGA , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[4]  André DeHon,et al.  Floating-point sparse matrix-vector multiply for FPGAs , 2005, FPGA '05.

[5]  John D. Davis,et al.  BLAS Comparison on FPGA, CPU and GPU , 2010, 2010 IEEE Computer Society Annual Symposium on VLSI.

[6]  Siddharth Joshi,et al.  FPGA Based High Performance Double-Precision Matrix Multiplication , 2009, VLSI Design.

[7]  Georgi Kuzmanov,et al.  Reconfigurable sparse/dense matrix-vector multiplier , 2009, 2009 International Conference on Field-Programmable Technology.

[8]  Shanq-Jang Ruan,et al.  FPGA acceleration of Sparse Matrix-Vector Multiplication based on Network-on-Chip , 2011, 2011 19th European Signal Processing Conference.

[9]  Yu Wang,et al.  FPGA and GPU implementation of large scale SpMV , 2010, 2010 IEEE 8th Symposium on Application Specific Processors (SASP).

[10]  Phillip H. Jones,et al.  An I/O Bandwidth-Sensitive Sparse Matrix-Vector Multiplication Engine on FPGAs , 2012, IEEE Transactions on Circuits and Systems I: Regular Papers.

[11]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[12]  Junqing Sun,et al.  Mapping Sparse Matrix-Vector Multiplication on FPGAs , 2007 .

[13]  James Demmel,et al.  Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[14]  M.E.T. Gerards Streaming reduction circuit for sparse matrixvector multiplication in FPGAs , 2008 .

[15]  David Gregg,et al.  FPGA Based Sparse Matrix Vector Multiplication using Commodity DRAM Memory , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[16]  Viktor K. Prasanna,et al.  Sparse Matrix-Vector multiplication on FPGAs , 2005, FPGA '05.

[17]  Youcef Saad,et al.  A Basic Tool Kit for Sparse Matrix Computations , 1990 .

[18]  Samuel Williams,et al.  Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[19]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[20]  Viktor K. Prasanna,et al.  High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs , 2007, IEEE Transactions on Parallel and Distributed Systems.

[21]  Chen Chang,et al.  BEE3: Revitalizing Computer Architecture Research , 2009 .

[22]  Yan Zhang,et al.  FPGA vs. GPU for sparse matrix vector multiply , 2009, 2009 International Conference on Field-Programmable Technology.

[23]  Jason D. Bakos,et al.  A Sparse Matrix Personality for the Convey HC-1 , 2011, 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines.

[24]  Sadaf R. Alam,et al.  Scientific Computing Beyond CPUs: FPGA implementations of common scientific kernels , 2005 .

[25]  Marcel van der Veen Sparse matrix vector multiplication on a field programmable gate array , 2007 .

[26]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.