Floating-point sparse matrix-vector multiply for FPGAs

Large, high density FPGAs with high local distributed memory bandwidth surpass the peak floating-point performance of high-end, general-purpose processors. Microprocessors do not deliver near their peak floating-point performance on efficient algorithms that use the Sparse Matrix-Vector Multiply (SMVM) kernel. In fact, it is not uncommon for microprocessors to yield only 10--20% of their peak floating-point performance when computing SMVM. We develop and analyze a scalable SMVM implementation on modern FPGAs and show that it can sustain high throughput, near peak, floating-point performance. For benchmark matrices from the Matrix Market Suite we project 1.5 double precision Gflops/FPGA for a single Virtex II 6000-4 and 12 double precision Gflops for 16 Virtex IIs (750Mflops/FPGA).

[1]  L. Trefethen,et al.  Numerical linear algebra , 1997 .

[2]  Frederic T. Chong,et al.  METRO: a router architecture for high-performance, short-haul routing networks , 1994, ISCA '94.

[3]  Keith D. Underwood,et al.  FPGAs vs. CPUs: trends in peak floating-point performance , 2004, FPGA '04.

[4]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[5]  Dorit S. Hochba,et al.  Approximation Algorithms for NP-Hard Problems , 1997, SIGA.

[6]  R. Melham A systolic accelerator for the iterative solution of sparse linear systems , 1989 .

[7]  Andrew B. Kahng,et al.  Improved algorithms for hypergraph bipartitioning , 2000, ASP-DAC '00.

[8]  Jung Ho Ahn,et al.  Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[9]  John Wawrzynek,et al.  Stochastic, spatial routing for hypergraphs, trees, and meshes , 2003, FPGA '03.

[10]  Pavle Belanovic,et al.  A Library of Parameterized Floating-Point Modules and Their Use , 2002, FPL.

[11]  James Demmel,et al.  Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[12]  Roy L. Russo,et al.  On a Pin Versus Block Relationship For Partitions of Logic Graphs , 1971, IEEE Transactions on Computers.

[13]  Charles E. Leiserson,et al.  Optimizing Synchronous Circuitry by Retiming (Preliminary Version) , 1983 .

[14]  J. Shalf,et al.  Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[15]  Brad L. Hutchings,et al.  JHDL-an HDL for reconfigurable systems , 1998, Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251).