论文信息 - A Power Efficient Linear Equation Solver on A Multi-Fpgaaccelerator

A Power Efficient Linear Equation Solver on A Multi-Fpgaaccelerator

Abstract This paper presents an approach to explore a commercial multi field programmable gate array (FPGA) system as high performance accelerator and the problem of solving an LU decomposed linear system of equations using forward and back substitution is addressed. Block-based right-hand-side solver algorithm is described and a novel data flow and memory architectures that can support arbitrary data types, block sizes and matrix sizes is proposed. These architectures have been implemented on a multi-FPGA system. Capabilities of the accelerator system are pushed to its limits by implementing the problem for double precision complex floatingpoint data. Detailed timing data is presented and augmented with data from a performance model proposed in this paper. Performance of the accelerator system is evaluated against that of a state of the art low power Beowulf cluster node running an optimized LAPACK implementation. Both systems are compared using the power efficiency (performance/watt) metric. FPG A system is about eleven times more power efficient than the compute node of a cluster.

[1] C. Siva Ram Murthy,et al. A New Parallel Algorithm for Solving Sparse Linear Systems , 1995, ISCAS.

[2] R. Ernst,et al. A mixed QoS SDRAM controller for FPGA-based high-end image processing , 2003, 2003 IEEE Workshop on Signal Processing Systems (IEEE Cat. No.03TH8682).

[3] Viktor K. Prasanna,et al. A high-performance and energy-efficient architecture for floating-point based LU decomposition on FPGAs , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[4] Yu-Fai Fung,et al. A PC based parallel LU decomposition algorithm for sparse matrices , 2003, 2003 IEEE Pacific Rim Conference on Communications Computers and Signal Processing (PACRIM 2003) (Cat. No.03CH37490).

[5] A. George,et al. Computational Density of Fixed and Reconfigurable Multi-Core Devices for Application Acceleration , 2008 .

[6] Viktor K. Prasanna,et al. Efficient Floating-point Based Block LU Decomposition on FPGAs , 2004, ERSA.

[7] Anjan Bose,et al. Parallel solution of large sparse matrix equations and parallel power flow , 1995 .

[8] W. Gropp,et al. Solution of dense systems of linear equations arising from integral-equation formulations , 1995 .

[9] Viktor Öwall,et al. Implementation of a scalable matrix inversion architecture for triangular matrices , 2003, 14th IEEE Proceedings on Personal, Indoor and Mobile Radio Communications, 2003. PIMRC 2003..

[10] Partha Pratim Pande,et al. Power efficiency in high performance computing , 2012 .

[11] Edusmildo Orozco,et al. Reconfigurable Computing. Accelerating Computation with Field-Programmable Gate Arrays , 2007, Scalable Comput. Pract. Exp..

[12] John Shalf,et al. Power efficiency in high performance computing , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[13] Ahmed El-Amawy. A Systolic Architecture for Fast Dense Matrix Inversion , 1989, IEEE Trans. Computers.

[14] Thomas Hauser,et al. Design of a Portable Cluster Supercomputer for Particle Image Velocimetry Data Processing , 2008, J. Aerosp. Comput. Inf. Commun..

[15] Sotirios G. Ziavras,et al. Performance optimization of an FPGA-based configurable multiprocessor for matrix operations , 2003, Proceedings. 2003 IEEE International Conference on Field-Programmable Technology (FPT) (IEEE Cat. No.03EX798).

[16] Viktor K. Prasanna,et al. Time and Energy Efficient Matrix Factorization Using FPGAs , 2003, FPL.

[17] Gadi Fibich,et al. Efficient Solution of A, x(k) = b(k) Using A−1 , 2007, J. Sci. Comput..

[18] Zhen Liu,et al. FPGA implementation of hierarchical memory architecture for network processors , 2004, Proceedings. 2004 IEEE International Conference on Field- Programmable Technology (IEEE Cat. No.04EX921).

[19] K. W. Chan. Parallel algorithms for direct solution of large sparse power system matrix equations , 2001 .

[20] Stephen M. Trimberger. Field-Programmable Gate Array Technology , 2007 .

[21] Pedro C. Diniz,et al. Synthesis and estimation of memory interfaces for FPGA-based reconfigurable computing engines , 2003, 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2003. FCCM 2003..

[22] Viktor K. Prasanna,et al. Scalable hybrid designs for linear algebra on reconfigurable computing systems , 2006, 12th International Conference on Parallel and Distributed Systems - (ICPADS'06).

[23] Jack Dongarra,et al. LAPACK: a portable linear algebra library for high-performance computers , 1990, SC.

[24] Volodymyr V. Kindratenko,et al. A case study in porting a production scientific supercomputing application to a reconfigurable computer , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[25] Aravind Dasu,et al. Performance of a LU decomposition on a multi-FPGA system compared to a low power commodity microprocessor system , 2007, Scalable Comput. Pract. Exp..

[26] Arvind Sudarsanam,et al. Multi-FPGA based High Performance LU Decomposition , 2006 .

[27] Sotirios G. Ziavras,et al. Parallel LU factorization of sparse matrices on FPGA‐based configurable computing engines , 2004, Concurr. Comput. Pract. Exp..

[28] Viktor K. Prasanna,et al. High-Performance Designs for Linear Algebra Operations on Reconfigurable Hardware , 2008, IEEE Transactions on Computers.

[29] Sotirios G. Ziavras,et al. Parallel LU factorization of sparse matrices on FPGA-based configurable computing engines: Research Articles , 2004 .

[30] Sadaf R. Alam,et al. Using FPGA Devices to Accelerate Biomolecular Simulations , 2007, Computer.

[31] Xin-Qing Sheng,et al. Implementation and experiments of a hybrid algorithm of the MLFMA-enhanced FE-BI method for open-region inhomogeneous electromagnetic problems , 2002 .

[32] Maya Gokhale,et al. Reconfigurable Computing: Accelerating Computation with Field-Programmable Gate Arrays , 2005 .

[33] Sotirios G. Ziavras,et al. A configurable multiprocessor and dynamic load balancing for parallel LU factorization , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[34] R. Venkatesh,et al. Parallel matrix inversion techniques , 1996, Proceedings of 1996 IEEE Second International Conference on Algorithms and Architectures for Parallel Processing, ICA/sup 3/PP '96.

[35] James Demmel,et al. SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems , 2003, TOMS.

[36] Karl S. Hemmert,et al. Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance , 2004, 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[37] Aravind Dasu,et al. Memory support design for LU decomposition on the starbridge hyper-computer , 2006, 2006 IEEE International Conference on Field Programmable Technology.

[38] Åke Björck,et al. Numerical Methods , 2021, Markov Renewal and Piecewise Deterministic Processes.

[39] S. G. Kratzer. Massively parallel sparse LU factorization , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.