Optimizing memory bandwidth use and performance for matrix-vector multiplication in iterative methods

Computing the solution to a system of linear equations is a fundamental problem in scientific computing, and its acceleration has drawn wide interest in the FPGA community [Morris et al. 2006; Zhang et al. 2008; Zhuo and Prasanna 2006]. One class of algorithms to solve these systems, iterative methods, has drawn particular interest, with recent literature showing large performance improvements over General-Purpose Processors (GPPs) [Lopes and Constantinides 2008]. In several iterative methods, this performance gain is largely a result of parallelization of the matrix-vector multiplication, an operation that occurs in many applications and hence has also been widely studied on FPGAs [Zhuo and Prasanna 2005; El-Kurdi et al. 2006]. However, whilst the performance of matrix-vector multiplication on FPGAs is generally I/O bound [Zhuo and Prasanna 2005], the nature of iterative methods allows the use of on-chip memory buffers to increase the bandwidth, providing the potential for significantly more parallelism [deLorimier and DeHon 2005]. Unfortunately, existing approaches have generally only either been capable of solving large matrices with limited improvement over GPPs [Zhuo and Prasanna 2005; El-Kurdi et al. 2006; deLorimier and DeHon 2005], or achieve high performance for relatively small matrices [Lopes and Constantinides 2008; Boland and Constantinides 2008]. This article proposes hardware designs to take advantage of symmetrical and banded matrix structure, as well as methods to optimize the RAM use, in order to both increase the performance and retain this performance for larger-order matrices.

[1]  Louis O. Hertzberger,et al.  Time complexity of a parallel conjugate gradient solver for light scattering simulations: theory and SPMD implementation , 1992 .

[2]  Viktor K. Prasanna,et al.  High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs , 2007, IEEE Transactions on Parallel and Distributed Systems.

[3]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[4]  A. Mercer Numerical Solution of Ordinary and Partial Differential Equations , 1963 .

[5]  George A. Constantinides,et al.  An FPGA-based implementation of the MINRES algorithm , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[6]  Granville Sewell,et al.  Initial Value Ordinary Differential Equations , 1988 .

[7]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[8]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[9]  Viktor K. Prasanna,et al.  Sparse Matrix-Vector multiplication on FPGAs , 2005, FPGA '05.

[10]  George A. Constantinides,et al.  Optimising Memory Bandwidth Use for Matrix-Vector Multiplication in Iterative Methods , 2010, ARC.

[11]  Viktor K. Prasanna,et al.  A Hybrid Approach for Mapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Supercomputer , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[12]  Wayne L. Winston Introduction to Mathematical Programming: Applications and Algorithms , 1990 .

[13]  Warren J. Gross,et al.  Sparse Matrix-Vector Multiplication for Finite Element Method Matrices on FPGAs , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[14]  Wei Zhang,et al.  Portable and scalable FPGA-based acceleration of a direct linear system solver , 2008, 2008 International Conference on Field-Programmable Technology.

[15]  Michael T. Heath,et al.  Scientific Computing , 2018 .

[16]  André DeHon,et al.  Floating-point sparse matrix-vector multiply for FPGAs , 2005, FPGA '05.

[17]  Eric C. Kerrigan,et al.  A floating-point solver for band structured linear equations , 2008, 2008 International Conference on Field-Programmable Technology.

[18]  George A. Constantinides,et al.  A High Throughput FPGA-based Floating Point Conjugate Gradient Implementation , 2008, ARC.

[19]  Viktor K. Prasanna,et al.  High-Performance and Parameterized Matrix Factorization on FPGAs , 2006, 2006 International Conference on Field Programmable Logic and Applications.

[20]  Jack Poulson,et al.  Scientific computing , 2013, XRDS.