Parallel LU factorization of sparse matrices on FPGA‐based configurable computing engines

Configurable computing, where hardware resources are configured appropriately to match specific hardware designs, has recently demonstrated its ability to significantly improve performance for a wide range of computation‐intensive applications. With steady advances in silicon technology, as predicted by Moore's Law, Field‐Programmable Gate Array (FPGA) technologies have enabled the implementation of System‐on‐a‐Programmable‐Chip (SOPC or SOC) computing platforms, which, in turn, have given a significant boost to the field of configurable computing. It is possible to implement various specialized parallel machines in a single silicon chip. In this paper, we describe our design and implementation of a parallel machine on an SOPC development board, using multiple instances of a soft IP configurable processor; we use this machine for LU factorization. LU factorization is widely used in engineering and science to solve efficiently large systems of linear equations. Our implementation facilitates the efficient solution of linear equations at a cost much lower than that of supercomputers and networks of workstations. The intricacies of our FPGA‐based design are presented along with tradeoff choices made for the purpose of illustration. Performance results prove the viability of our approach. Copyright © 2004 John Wiley & Sons, Ltd.

[1]  Sotirios G. Ziavras Scalable Multifolded Hypercubes for versatile Parallel Computers , 1995, Parallel Process. Lett..

[2]  Sotirios G. Ziavras Efficient Mapping Algorithms for a Class of Hierarchical Systems , 1993, IEEE Trans. Parallel Distributed Syst..

[3]  Sotirios G. Ziavras,et al.  Dataflow computation with intelligent memories emulated on field-programmable gate arrays (FPGAs) , 2002, Microprocess. Microsystems.

[4]  Steven Pigeon,et al.  VIP: an FPGA-based processor for image processing and neural networks , 1996, Proceedings of Fifth International Conference on Microelectronics for Neural Networks.

[5]  William J. Dally,et al.  Migration in Single Chip Multiprocessors , 2002, IEEE Computer Architecture Letters.

[6]  Daniel Tylavsky,et al.  Parallel processing in power systems computation , 1992 .

[7]  Sotirios G. Ziavras,et al.  A Universal, Dynamically Adaptable and Programmable Network Router for Parallel Computers , 2001, VLSI Design.

[8]  Tao Yang,et al.  Efficient Sparse LU Factorization with Partial Pivoting on Distributed Memory Architectures , 1998, IEEE Trans. Parallel Distributed Syst..

[9]  Leopoldo García Franquelo,et al.  An efficient ordering algorithm to improve sparse vector methods , 1988 .

[10]  Gordon Bell,et al.  High Performance Computing: Crays, Clusters, and Centers. What Next? , 2001 .

[11]  J. E. Van Ness,et al.  Parallel solution of sparse algebraic equations , 1993 .

[12]  Steven Tuecke,et al.  The Physiology of the Grid An Open Grid Services Architecture for Distributed Systems Integration , 2002 .

[13]  Duncan A. Buell,et al.  Splash 2 - FPGAs in a custom computing machine , 1996 .

[14]  Dominique Lavenier,et al.  Evaluation of the streams-C C-to-FPGA compiler: an applications perspective , 2001, FPGA '01.

[15]  Hoay Beng Gooi,et al.  New ordering methods for sparse matrix inversion via diagonalization , 1997 .

[16]  Russell Tessier,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. Reconfigurable Computing for Digital Signal Processing: A Survey ∗ , 1999 .

[17]  Scott Hauck,et al.  Reconfigurable computing: a survey of systems and software , 2002, CSUR.

[18]  Leon O. Chua,et al.  Diakoptic and generalized hybrid analysis , 1976 .

[19]  L. G. Franquelo,et al.  Mode ordering algorithms for sparse vector method improvement , 1988 .

[20]  Sharad Malik,et al.  Exploiting operation level parallelism through dynamically reconfigurable datapaths , 2002, DAC '02.

[21]  I. Duff,et al.  Direct Methods for Sparse Matrices , 1987 .

[22]  Maya Gokhale,et al.  Stream-oriented FPGA computing in the Streams-C high level language , 2000, Proceedings 2000 IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00871).

[23]  Constantine N. Manikopoulos,et al.  Parallel DSP algorithms on TurboNet: an experimental system with hybrid message‐passing/shared‐memory architecture , 1996 .

[24]  Hadi Saadat,et al.  Power System Analysis , 1998 .

[25]  Sotirios G. Ziavras Investigation of Various Mesh Architectures With Broadcast Buses for High-Performance Computing , 1999, VLSI Design.

[26]  G. T. Heydt,et al.  Computer Analysis Methods for Power Systems , 1986 .

[27]  Reiner W. Hartenstein,et al.  A decade of reconfigurable computing: a visionary retrospective , 2001, Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001.

[28]  James Demmel,et al.  An Asynchronous Parallel Supernodal Algorithm for Sparse Gaussian Elimination , 1997, SIAM J. Matrix Anal. Appl..

[29]  Geoffrey C. Fox,et al.  A parallel Gauss-Seidel algorithm for sparse power system matrices , 1994, Proceedings of Supercomputing '94.

[30]  Peter M. Athanas,et al.  Quantitative analysis of floating point arithmetic on FPGA based custom computing machines , 1995, Proceedings IEEE Symposium on FPGAs for Custom Computing Machines.

[31]  Alexander J. Flueck,et al.  A message-passing distributed-memory parallel power flow algorithm , 2002, 2002 IEEE Power Engineering Society Winter Meeting. Conference Proceedings (Cat. No.02CH37309).