The performance impact of data reuse in parallel dense Cholesky factorization

This paper explores performance issues for several prominent approaches to parallel dense Cholesky factorization. The primary focus is on issues that arise when blocking techniques are integrated into parallel factorization approaches to improve data reuse in the memory hierarchy. We first consider panel-oriented approaches, where sets of contiguous columns are manipulated as single units. These methods represent natural extensions of the column-oriented methods that have been widely used previously. On machines with memory hierarchies, panel-oriented methods significantly increase the achieved performance over column-oriented methods. However, we find that panel- oriented methods do not expose enough concurrency for problems that one might reasonably expect to solve on moderately parallel machines, thus significantly limiting their performance. We then explore block-oriented approaches, where square submatrices are manipulated instead of sets of columns. These methods greatly increase the amount of available concurrency, thus alleviating the problems encountered with panel-oriented methods. However, a number of issues, including scheduling choices and block- placement issues, complicate their implementation. We discuss these issues and consider approaches that solve the resulting problems. The resulting block-oriented implementation yields high processor utilization levels over a wide range of problem sizes.

[1]  Y. Saad,et al.  Communication complexity of the Gaussian elimination algorithm on multiprocessors , 1986 .

[2]  G. A. Geist,et al.  Parallel Cholesky factorization on a hypercube multiprocessor , 1985 .

[3]  A. George,et al.  Parallel Cholesky factorization on a shared-memory multiprocessor. Final report, 1 October 1986-30 September 1987 , 1986 .

[4]  Anoop Gupta,et al.  Design of scalable shared-memory multiprocessors: the DASH approach , 1990, Digest of Papers Compcon Spring '90. Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage.

[5]  Jack Dongarra,et al.  LAPACK: a portable linear algebra library for high-performance computers , 1990, SC.

[6]  G. Stewart,et al.  Assignment and scheduling in parallel matrix factorization , 1986 .

[7]  William Jalby,et al.  Impact of Hierarchical Memory Systems On Linear Algebra Algorithm Design , 1988 .

[8]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[9]  Vijay K. Naik,et al.  Data traffic reduction schemes for Cholesky factorization on asynchronous multiprocessor systems , 1989, ICS '89.

[10]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[11]  Anoop Gupta,et al.  Techniques for improving the performance of sparse matrix factorization on multiprocessor workstations , 1990, Proceedings SUPERCOMPUTING '90.

[12]  J. Dongarra Performance of various computers using standard linear equations software , 1990, CARN.