论文信息 - Floating-point sparse matrix-vector multiply for FPGAs

Floating-point sparse matrix-vector multiply for FPGAs

Large, high density FPGAs with high local distributed memory bandwidth surpass the peak floating-point performance of high-end, general-purpose processors. Microprocessors do not deliver near their peak floating-point performance on efficient algorithms that use the Sparse Matrix-Vector Multiply (SMVM) kernel. In fact, it is not uncommon for microprocessors to yield only 10--20% of their peak floating-point performance when computing SMVM. We develop and analyze a scalable SMVM implementation on modern FPGAs and show that it can sustain high throughput, near peak, floating-point performance. For benchmark matrices from the Matrix Market Suite we project 1.5 double precision Gflops/FPGA for a single Virtex II 6000-4 and 12 double precision Gflops for 16 Virtex IIs (750Mflops/FPGA).

André DeHon | Michael DeLorimier

[1] L. Trefethen,et al. Numerical linear algebra , 1997 .

[2] Frederic T. Chong,et al. METRO: a router architecture for high-performance, short-haul routing networks , 1994, ISCA '94.

[3] Keith D. Underwood,et al. FPGAs vs. CPUs: trends in peak floating-point performance , 2004, FPGA '04.

[4] Richard Vuduc,et al. Automatic performance tuning of sparse matrix kernels , 2003 .

[5] Dorit S. Hochba,et al. Approximation Algorithms for NP-Hard Problems , 1997, SIGA.

[6] R. Melham. A systolic accelerator for the iterative solution of sparse linear systems , 1989 .

[7] Andrew B. Kahng,et al. Improved algorithms for hypergraph bipartitioning , 2000, ASP-DAC '00.

[8] Jung Ho Ahn,et al. Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[9] John Wawrzynek,et al. Stochastic, spatial routing for hypergraphs, trees, and meshes , 2003, FPGA '03.

[10] Pavle Belanovic,et al. A Library of Parameterized Floating-Point Modules and Their Use , 2002, FPL.

[11] James Demmel,et al. Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[12] Roy L. Russo,et al. On a Pin Versus Block Relationship For Partitions of Logic Graphs , 1971, IEEE Transactions on Computers.

[13] Charles E. Leiserson,et al. Optimizing Synchronous Circuitry by Retiming (Preliminary Version) , 1983 .

[14] J. Shalf,et al. Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[15] Brad L. Hutchings,et al. JHDL-an HDL for reconfigurable systems , 1998, Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251).