Performance and Scalability of Preconditioned Conjugate Gradient Methods on Parallel Computers

This paper analyzes the performance and scalability of an iteration of the preconditioned conjugate gradient algorithm on parallel architectures with a variety of interconnection networks, such as the mesh, the hypercube, and that of the CM-5 parallel computer. It is shown that for block-tridiagonal matrices resulting from two-dimensional finite difference grids, the communication overhead due to vector inner products dominates the communication overheads of the remainder of the computation on a large number of processors. However, with a suitable mapping, the parallel formulation of a PCG iteration is highly scalable for such matrices on a machine like the CM-5 whose fast control network practically eliminates the overheads due to inner product computation. The use of the truncated Incomplete Cholesky (IC) preconditioner can lead to further improvement in scalability on the CM-5 by a constant factor,as a result, a parallel formulation of the PCG algorithm with IC preconditioner may execute faster than that with a simple diagonal preconditioner even if the latter runs faster in a serial implementation. >

[1]  Geoffrey C. Fox,et al.  Solving problems on concurrent processors: vol. 2 , 1990 .

[2]  Vipin Kumar,et al.  A scalable parallel algorithm for sparse Cholesky factorization , 1994, Proceedings of Supercomputing '94.

[3]  J. R. Zirbas,et al.  Measuring the scalability of parallel computer systems , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[4]  P. Sadayappan,et al.  Iterative Algorithms for Solution of Large Sparse Systems of Linear Equations on Hypercubes , 1988, IEEE Trans. Computers.

[5]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.

[6]  V. Nageshwara Rao,et al.  Scalable parallel formulations of depth-first search , 1990 .

[7]  J. Pasciak,et al.  Computer solution of large sparse positive definite systems , 1982 .

[8]  G. C. Fox,et al.  Solving Problems on Concurrent Processors , 1988 .

[9]  Guo-Jie Li,et al.  Optimal Granularity of Grid Iteration Problems , 1990, International Conference on Parallel Processing.

[10]  Patrick H. Worley,et al.  The Effect of Time Constraints on Scaled Speedup , 1990, SIAM J. Sci. Comput..

[11]  Henk A. van der Vorst,et al.  Large tridiagonal and block tridiagonal linear systems on vector and parallel computers , 1987, Parallel Comput..

[12]  Vipin Kumar,et al.  The Scalability of FFT on Parallel Computers , 1993, IEEE Trans. Parallel Distributed Syst..

[13]  Henk A. van der Vorst,et al.  A Vectorizable Variant of some ICCG Methods , 1982 .

[14]  Sartaj Sahni,et al.  Hypercube algorithms for image processing and pattern recognition , 1990 .

[15]  GuptaAnshul,et al.  Performance and Scalability of Preconditioned Conjugate Gradient Methods on Parallel Computers , 1995 .

[16]  Vipin Kumar,et al.  A Scalable Parallel Algorithm for Sparse Matrix Factorization , 1994 .

[17]  Norman E. Gibbs,et al.  A Comparison of Several Bandwidth and Profile Reduction Algorithms , 1976, TOMS.

[18]  Edward D. Lazowska,et al.  Speedup Versus Efficiency in Parallel Systems , 1989, IEEE Trans. Computers.

[19]  Alan George,et al.  Computer Solution of Large Sparse Positive Definite , 1981 .

[20]  Alan H. Karp,et al.  Measuring parallel processor performance , 1990, CACM.

[21]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[22]  Vijay P. Kumar,et al.  Analyzing Scalability of Parallel Algorithms and Architectures , 1994, J. Parallel Distributed Comput..

[23]  Frederic A. Van-Catledge Toward a General Model for Evaluating the Relative Performance of Computer Systems , 1989, Int. J. High Perform. Comput. Appl..

[24]  Youcef Saad,et al.  A Basic Tool Kit for Sparse Matrix Computations , 1990 .

[25]  Sartaj Sahni,et al.  Hypercube Algorithms: with Applications to Image Processing and Pattern Recognition , 1990 .

[26]  Robert Schreiber,et al.  Efficient ICCG on a Shared Memory Multiprocessor , 1992, Int. J. High Speed Comput..

[27]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .

[28]  Vipin Kumar,et al.  Scalability of parallel sorting on mesh multicomputers , 1991, [1991] Proceedings. The Fifth International Parallel Processing Symposium.

[29]  Robert E. Benner,et al.  Development of Parallel Methods for a $1024$-Processor Hypercube , 1988 .

[30]  Youcef Saad,et al.  Parallel Implementations of Preconditioned Conjugate Gradient Methods. , 1985 .

[31]  Kai Hwang,et al.  Advanced computer architecture - parallelism, scalability, programmability , 1992 .

[32]  Vipin Kumar,et al.  Analysis of scalability of parallel algorithms and architectures: a survey , 1991, ICS '91.

[33]  Vipin Kumar,et al.  Isoefficiency: measuring the scalability of parallel algorithms and architectures , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[34]  P. C. Messina Emerging supercomputer architectures , 1987 .

[35]  Sartaj Sahni,et al.  A Hypercube Algorithm for the 0/1 Knapsack Problem , 1988, J. Parallel Distributed Comput..

[36]  Rami Melhem,et al.  Toward Efficient Implementation of Preconditioned Conjugate Gradient Methods On Vector Supercomputers , 1987 .

[37]  Anthony T. Chronopoulos,et al.  A class of Lanczos-like algorithms implemented on parallel computers , 1991, Parallel Comput..

[38]  Anant Agarwal,et al.  Scalability of parallel machines , 1991, CACM.