Migration of Vectorized Iterative Solvers to Distributed-Memory Architectures

Distributed-memory parallel processors (DMPPs) can deliver peak performance higher than vector supercomputers while promising a better cost-performance ratio. Programming, however, is harder than on traditional vector systems, especially when problems necessitating unstructured solution methods are considered. A class of such applications, with large resource requirements, is the numerical solution of partial differential equations (PDEs) on nonuniformly refined three-dimensional finite element discretizations. Porting an application of this class from vector and shared-memory parallel machines to DMPPs involves some fundamental algorithm changes, such as grid decomposition, mapping, and coloring strategies. In addition, no standardized language interface is available to ease the efficient parallelization and porting among DMPPs and between vector computers and DMPPs. This article describes how PILS-an existing package for the iterative solution of large unstructured sparse linear systems of equations on vector computers-was ported to DMPPs, using the parallelizing Fortran compiler Oxygen. Two DMPPs, namely an Intel Paragon and a Fujitsu AP1000, were used to evaluate the performance of the generated parallel program quantitatively. The results indicate how an application should be designed to be portable among supercomputers of different architecture. Several language and architecture features are essential for such a porting process and ease the parallelization of similar applications drastically.

[1]  Charles Koelbel,et al.  Compiling Global Name-Space Parallel Loops for Distributed Execution , 1991, IEEE Trans. Parallel Distributed Syst..

[2]  M. Annaratone,et al.  Interprocessor communication speed and performance in distributed-memory parallel processors , 1989, ISCA '89.

[3]  Wolfgang Fichtner,et al.  A Set of New Mapping and Coloring Heuristics for Distributed-Memory Parallel Processors , 1992, SIAM J. Sci. Comput..

[4]  Philip J. Hatcher,et al.  Architecture-independent scientific programming in data parallel C: three case studies , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[5]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[6]  Y. Saad,et al.  Krylov Subspace Methods on Supercomputers , 1989 .

[7]  Roland Rühl A parallelizing compiler for distributed memory parallel processors , 1992 .

[8]  Roland Rühl,et al.  Automatic parallelization of LINPACK routines on distributed memory parallel processors , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.

[9]  H. Elman Iterative methods for large, sparse, nonsymmetric systems of linear equations , 1982 .

[10]  Harry Berryman,et al.  Execution time support for adaptive scientific algorithms on distributed memory machines , 1991, Concurr. Pract. Exp..

[11]  Gernot Heiser,et al.  Three-dimensional numerical semiconductor device simulation: algorithms, architectures, results , 1991, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[12]  Y. Saad,et al.  GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[13]  Michael Gerndt,et al.  SUPERB: A tool for semi-automatic MIMD/SIMD parallelization , 1988, Parallel Comput..

[14]  R. Fletcher Conjugate gradient methods for indefinite systems , 1976 .

[15]  Charles Koelbel,et al.  Supporting shared data structures on distributed memory architectures , 1990, PPOPP '90.

[16]  P. Sonneveld CGS, A Fast Lanczos-Type Solver for Nonsymmetric Linear systems , 1989 .

[17]  R. Ruhl,et al.  Balancing interprocessor communication and computation on torus-connected multicomputers running compiler-parallelized code , 1992, Proceedings Scalable High Performance Computing Conference SHPCC-92..

[18]  Anthony P. Reeves,et al.  Data remapping for distributed-memory multicomputers , 1992, Proceedings Scalable High Performance Computing Conference SHPCC-92..

[19]  Frank Tip,et al.  Parametric program slicing , 1995, POPL '95.

[20]  S. Eisenstat Efficient Implementation of a Class of Preconditioned Conjugate Gradient Methods , 1981 .

[21]  J. Meijerink,et al.  An iterative solution method for linear systems of which the coefficient matrix is a symmetric -matrix , 1977 .

[22]  Roland Rühl Evaluation of compiler generated parallel programs on three multicomputers , 1992, ICS '92.

[23]  Manish Gupta,et al.  PARADIGM: a compiler for automatic data distribution on multicomputers , 1993, ICS '93.