Implementation of mixed precision in solving systems of linear equations on the Cell processor

This paper describes the design concepts behind implementations of mixed‐precision linear algebra routines targeted for the Cell processor. It describes in detail the implementation of code to solve linear system of equations using Gaussian elimination in single precision with iterative refinement of the solution to the full double‐precision accuracy. By utilizing this approach the algorithm achieves close to an order of magnitude higher performance on the Cell processor than the performance offered by the standard double‐precision algorithm. The code is effectively an implementation of the high‐performance LINPACK benchmark, as it meets all of the requirements concerning the problem being solved and the numerical properties of the solution. Copyright © 2007 John Wiley & Sons, Ltd.

[1]  Samuel Williams,et al.  The potential of the cell processor for scientific computing , 2005, CF '06.

[2]  S. Asano,et al.  The design and implementation of a first-generation CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[3]  S.H. Dhong,et al.  A fully pipelined single-precision floating-point unit in the synergistic processor element of a CELL processor , 2006, IEEE Journal of Solid-State Circuits.

[4]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[5]  J. Dongarra,et al.  Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy (Revisiting Iterative Refinement for Linear Systems) , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[6]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[7]  Jack J. Dongarra,et al.  Performance of various computers using standard linear equations software in a FORTRAN environment , 1988, CARN.

[8]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[9]  James D. Meindl,et al.  Solid-State Circuits Conference , 1969 .

[10]  S.H. Dhong,et al.  A 4.8GHz fully pipelined embedded SRAM in the streaming processor of a CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[11]  Viktor K. Prasanna,et al.  Analysis of memory hierarchy performance of block data layout , 2002, Proceedings International Conference on Parallel Processing.

[12]  G. Stewart Introduction to matrix computations , 1973 .

[13]  James Demmel,et al.  LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.

[14]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[15]  B. Flachs,et al.  The circuits and physical design of the synergistic processor element of a CELL processor , 2005, Digest of Technical Papers. 2005 Symposium on VLSI Circuits, 2005..

[16]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[17]  S. I. Feldman,et al.  A Fortran to C converter , 1990, FORF.

[18]  J.B. Kuang,et al.  A double-precision multiplier with fine-grained clock-gating support for a first-generation CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[19]  B. Flachs,et al.  A streaming processing unit for a CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[20]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[21]  Viktor K. Prasanna,et al.  Tiling, Block Data Layout, and Memory Hierarchy Performance , 2003, IEEE Trans. Parallel Distributed Syst..