A data locality-aware design framework for reconfigurable sparse matrix-vector multiplication kernel

Sparse matrix-vector multiplication (SpMV) is an important computational kernel in many applications. For performance improvement, software libraries designated for SpMV computation have been introduced, e.g., MKL library for CPUs and cuSPARSE library for GPUs. However, the computational throughput of these libraries is far below the peak floating-point performance offered by hardware platforms, because the efficiency of SpMV kernel is greatly constrained by the limited memory bandwidth and irregular data access patterns. In this work, we propose a data locality-aware design framework for FPGA-based SpMV acceleration. We first include the hardware constraints in sparse matrix compression at software level to regularize the memory allocation and accesses. Moreover, a distributed architecture composed of processing elements is developed to improve the computation parallelism. We implement the reconfigurable SpMV kernel on Convey HC-2ex and conduct the evaluation by using the University of Florida sparse matrix collection. The experiments demonstrate an average computational efficiency of 48.2%, which is a lot better than those of CPU and GPU implementations. Our FPGA-based kernel has a comparable runtime as GPU, and achieves 2.1× reduction than CPU. Moreover, our design obtains substantial saving in energy consumption, say, 9.3× and 5.6× better than the implementations on CPU and GPU, respectively.

[1]  André DeHon,et al.  Floating-point sparse matrix-vector multiply for FPGAs , 2005, FPGA '05.

[2]  Martin B. van Gijzen,et al.  IDR(s): A Family of Simple and Fast Algorithms for Solving Large Nonsymmetric Systems of Linear Equations , 2008, SIAM J. Sci. Comput..

[3]  Rob H. Bisseling,et al.  Cache-Oblivious Sparse Matrix--Vector Multiplication by Using Sparse Matrix Partitioning Methods , 2009, SIAM J. Sci. Comput..

[4]  Yu Wang,et al.  FPGA and GPU implementation of large scale SpMV , 2010, 2010 IEEE 8th Symposium on Application Specific Processors (SASP).

[5]  Philip Heng Wai Leong,et al.  A Model for Matrix Multiplication Performance on FPGAs , 2011, 2011 21st International Conference on Field Programmable Logic and Applications.

[6]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[7]  Robin Pottathuparambil,et al.  Implications of Memory-Efficiency on Sparse Matrix-Vector Multiplication , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[8]  Jason D. Bakos,et al.  A Sparse Matrix Personality for the Convey HC-1 , 2011, 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines.

[9]  Phillip H. Jones,et al.  An I/O Bandwidth-Sensitive Sparse Matrix-Vector Multiplication Engine on FPGAs , 2012, IEEE Transactions on Circuits and Systems I: Regular Papers.

[10]  Mehmet Deveci,et al.  Multithreaded Clustering for Multi-level Hypergraph Partitioning , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[11]  Sivasankaran Rajamanickam,et al.  Scalable matrix computations on large scale-free graphs using 2D graph partitioning , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Benjamin A. Miller,et al.  Sparse matrix partitioning for parallel eigenanalysis of large static and dynamic graphs , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[13]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .

[14]  Dejan Markovic,et al.  A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs , 2014, FPGA.

[15]  Eric S. Chung,et al.  A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[16]  Wayne Luk,et al.  Accelerating SpMV on FPGAs by Compressing Nonzero Values , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[17]  Nachiket Kapre,et al.  Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs , 2015, IEEE Transactions on Parallel and Distributed Systems.