Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization

The Sony/Toshiba/IBM (STI) CELL processor introduces pioneering solutions in processor architecture. At the same time it presents new challenges for the development of numerical algorithms. One is effective exploitation of the differential between the speed of single and double precision arithmetic; the other is efficient parallelization between the short vector SIMD cores. The first challenge is addressed by utilizing the well known technique of iterative refinement for the solution of a dense symmetric positive definite system of linear equations, resulting in a mixed-precision algorithm, which delivers double precision accuracy, while performing the bulk of the work in single precision. The main contribution of this paper lies in addressing the second challenge by successful thread-level parallelization, exploiting fine-grained task granularity and a lightweight decentralized synchronization. The implementation of the computationally intensive sections gets within 90 percent of peak floating point performance, while the implementation of the memory intensive sections reaches within 90 percent of peak memory bandwidth. On a single CELL processor, the algorithm achieves over 170~Gflop/s when solving a symmetric positive definite system of linear equation in single precision and over 150~Gflop/s when delivering the result in double precision accuracy.

[1]  Ramesh C. Agarwal,et al.  Vector and parallel algorithms for Cholesky factorization on IBM 3090 , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[2]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[3]  Xiaomei Yang Rounding Errors in Algebraic Processes , 1964, Nature.

[4]  Cleve B. Moler,et al.  Iterative Refinement in Floating Point , 1967, JACM.

[5]  Jack Dongarra,et al.  Numerical Linear Algebra for High-Performance Computers , 1998 .

[6]  Jack J. Dongarra,et al.  Implementation of mixed precision in solving systems of linear equations on the Cell processor , 2007, Concurr. Comput. Pract. Exp..

[7]  G. Stewart Introduction to matrix computations , 1973 .

[8]  Jason N. Dale,et al.  Cell Broadband Engine Architecture and its first implementation - A performance view , 2007, IBM J. Res. Dev..

[9]  Jack J. Dongarra,et al.  Implementing Linear Algebra Routines on Multi-core Processors with Pipelining and a Look Ahead , 2006, PARA.

[10]  John A. Gunnels,et al.  A fully portable high performance minimal storage hybrid format Cholesky algorithm , 2005, TOMS.

[11]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[12]  J. Dongarra,et al.  SCOP3: A Rough Guide to Scientific Computing On the PlayStation 3 , 2007 .

[13]  Jack Dongarra,et al.  SCOP3: A Rough Guide to Scientific Computing On the PlayStation 3 , 2007 .

[14]  J. Dongarra,et al.  Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy (Revisiting Iterative Refinement for Linear Systems) , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[15]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..