论文信息 - A Fine-grained Pipelined Implementation of the LINPACK Benchmark on FPGAs

A Fine-grained Pipelined Implementation of the LINPACK Benchmark on FPGAs

Previous works have projected that the peak performance of FPGAs can outperform that of the general purpose processors. However, no work actually compares the performance between FPGAs and CPUs using the standard benchmarks such as the LINPACK benchmark. We propose and implement an FPGA-based hardware design of the LINPACK benchmark, the key step of which is LU decomposition with pivoting. We introduce a fine-grained pipelined LU decomposition algorithm that enables optimum performance by exploiting fine-grained pipeline parallelism. A scalable linear array of processing elements (PEs), which is the core component of our hardware design, is proposed to implement this algorithm. To the best of our knowledge, this is the first reported FPGA-based pipelined implementation of LU decomposition with pivoting. A total of 19 PEs can be integrated into an Altera Stratix II EP2S130F1020C5 on our self-designed development board. Experimental results show that the speedup up to 6.14 can be achieved relative to a Pentium 4 processor for the LINPACK benchmark.

[1] Brent E. Nelson,et al. Novel Optimizations for Hardware Floating-Point Units in a Modern FPGA Architecture , 2002, FPL.

[2] Brent E. Nelson,et al. Tradeoffs of designing floating-point division and square root on Virtex FPGAs , 2003, 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2003. FCCM 2003..

[3] Aravind Dasu,et al. Performance of a LU decomposition on a multi-FPGA system compared to a low power commodity microprocessor system , 2007, Scalable Comput. Pract. Exp..

[4] Philip Heng Wai Leong,et al. FPGA Based Acceleration of the Linpack Benchmark: A High Level Code Transformation Approach , 2006, 2006 International Conference on Field Programmable Logic and Applications.

[5] Sanjay V. Rajopadhye,et al. An Improved Systolic Architecture for LU Decomposition , 2006, IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP'06).

[6] Viktor K. Prasanna,et al. High-Performance Designs for Linear Algebra Operations on Reconfigurable Hardware , 2008, IEEE Transactions on Computers.

[7] Viktor K. Prasanna,et al. A high-performance and energy-efficient architecture for floating-point based LU decomposition on FPGAs , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[8] Russell Tessier,et al. Floating point unit generation and evaluation for FPGAs , 2003, 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2003. FCCM 2003..

[9] Karl S. Hemmert,et al. Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance , 2004, 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[10] Viktor K. Prasanna,et al. Time and Energy Efficient Matrix Factorization Using FPGAs , 2003, FPL.

[11] Yong Dou,et al. 64-bit floating-point FPGA matrix multiplication , 2005, FPGA '05.

[12] Gregory D. Peterson,et al. High-Performance Mixed-Precision Linear Solver for FPGAs , 2008, IEEE Transactions on Computers.

[13] Viktor K. Prasanna,et al. High Performance Linear Algebra Operations on Reconfigurable Systems , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[14] Jack J. Dongarra,et al. The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[15] Keshab K. Parhi,et al. A Fast Radix-4 Division Algorithm and Its Architecture , 1995, IEEE Trans. Computers.