A strategy for mapping unstructured mesh computational mechanics programs onto distributed memory parallel architectures

The motivation of this thesis was to develop strategies that would enable unstructured mesh based computational mechanics codes to exploit the computational advantages offered by distributed memory parallel processors. Strategies that successfully map structured mesh codes onto parallel machines have been developed over the previous decade and used to build a toolkit for automation of the parallelisation process. Extension of the capabilities of this toolkit to include unstructured mesh codes requires new strategies to be developed. This thesis examines the method of parallelisation by geometric domain decomposition using the single program multi data programming paradigm with explicit message passing. This technique involves splitting (decomposing) the problem definition into P parts that may be distributed over P processors in a parallel machine. Each processor runs the same program and operates only on its part of the problem. Messages passed between the processors allow data exchange to maintain consistency with the original algorithm. The strategies developed to parallelise unstructured mesh codes should meet a number of requirements: The algorithms are faithfully reproduced in parallel. The code is largely unaltered in the parallel version. The parallel efficiency is maximised. The techniques should scale to highly parallel systems. The parallelisation process should become automated. Techniques and strategies that meet these requirements are developed and tested in this dissertation using a state of the art integrated computational fluid dynamics and solid mechanics code. The results presented demonstrate the importance of the problem partition in the definition of inter-processor communication and hence parallel performance. The classical measure of partition quality based on the number of cut edges in the mesh partition can be inadequate for real parallel machines. Consideration of the topology of the parallel machine in the mesh partition is demonstrated to be a more significant factor than the number of cut edges in the achieved parallel efficiency. It is shown to be advantageous to allow an increase in the volume of communication in order to achieve an efficient mapping dominated by localised communications. The limitation to parallel performance resulting from communication startup latency is clearly revealed together with strategies to minimise the effect. The generic application of the techniques to other unstructured mesh codes is discussed in the context of automation of the parallelisation process. Automation of parallelisation based on the developed strategies is presented as possible through the use of run time inspector loops to accurately determine the dependencies that define the necessary inter-processor communication.

[1]  Edwin R. Galea,et al.  Application of a parallel CFD code to large-scale practical problems , 1993 .

[2]  Horst D. Simon,et al.  Partitioning of unstructured problems for parallel processing , 1991 .

[3]  Barbara M. Chapman,et al.  Supercompilers for parallel and vector computers , 1990, ACM Press frontier series.

[4]  Dirk Roose,et al.  Distributed memory parallel computers and computational fluid dynamics , 1993 .

[5]  Kai Hwang,et al.  Computer architecture and parallel processing , 1984, McGraw-Hill Series in computer organization and architecture.

[6]  Greg Wilson,et al.  "Past, Present, Parallel": A Survey Of Available Parallel Computer Systems , 1991 .

[7]  Charbel Farhat,et al.  A simple and efficient automatic fem domain decomposer , 1988 .

[8]  J. Ramanujam,et al.  Cluster partitioning approaches to mapping parallel programs onto a hypercube , 1987, Parallel Comput..

[9]  Chris Walshaw,et al.  Evaluation of the JOSTLE mesh partitioning code for practical multiphysics applications , 1996 .

[10]  Horst D. Simon,et al.  Fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems , 1994, Concurr. Pract. Exp..

[11]  Peter Brezany,et al.  Vienna Fortran Compilation System - Version 1.2 - User's Guide , 1996 .

[12]  M. Cross,et al.  Mapping structured grid three-dimensional CFD codes onto parallel architectures , 1991 .

[13]  C. A. R. Hoare,et al.  Communicating Sequential Processes (Reprint) , 1983, Commun. ACM.

[14]  INMOS Limited 1000 Aztec West Almondsbury Bristol The transputer applications notebook - systems and performance , 1989 .

[15]  Cos S. Ierotheou,et al.  Parallelisation of a novel 3D hybrid structured/unstructured grid CFD production code , 1995, HPCN Europe.

[16]  G. Rodrigue Parallel Computations , 1982 .

[17]  Chris Bailey,et al.  A control volume procedure for solving the elastic stress-strain equations on an unstructured mesh , 1991 .

[18]  Allan Gottlieb,et al.  Highly parallel computing , 1989, Benjamin/Cummings Series in computer science and engineering.

[19]  J. Davenport Editor , 1960 .

[20]  Jack J. Dongarra,et al.  Message-Passing Performance of Various Computers , 1997, Concurr. Pract. Exp..

[21]  Geoffrey C. Fox,et al.  Solving problems on concurrent processors: vol. 2 , 1990 .

[22]  George Karypis,et al.  Multilevel k-way Partitioning Scheme for Irregular Graphs , 1998, J. Parallel Distributed Comput..

[23]  Fred Glover,et al.  Tabu Search - Part II , 1989, INFORMS J. Comput..

[24]  J. Shewchuk An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[25]  R. Wait,et al.  Distributed finite element calculations on transputer arrays and the DAP , 1991 .

[26]  Stephen Philip Johnson Mapping numerical software onto distributed memory parallel systems , 1992 .

[27]  Burton J. Smith,et al.  The end of architecture , 1990, CARN.

[28]  Chris R. Jesshope,et al.  Parallel Computers 2: Architecture, Programming and Algorithms , 1981 .

[29]  Gerry E. Schneider,et al.  Control Volume Finite-Element Method for Heat Transfer and Fluid Flow Using Colocated Variables— 1. Computational Procedure , 1987 .

[30]  Harvey Richardson,et al.  High Performance Fortran: history, overview and current developments , 1996 .

[31]  Alex Pothen,et al.  PARTITIONING SPARSE MATRICES WITH EIGENVECTORS OF GRAPHS* , 1990 .

[32]  Ralf Diekmann,et al.  Parallel Decomposition of Unstructured FEM-Meshes , 1995, Concurr. Pract. Exp..

[33]  Marcin Paprzycki,et al.  Parallel computing works! , 1996, IEEE Parallel & Distributed Technology: Systems & Applications.

[34]  Tim Hopkins,et al.  Parallel Preconditioned Conjugate-Gradients Methods on Transputer Networks , 1993 .

[35]  Jack Dongarra,et al.  PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .

[36]  X. Zhang,et al.  Solving Computational Fluid Dynamics Problems on Unstructured Grids with Distributed Parallel Processing , 1995, IRREGULAR.

[37]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[38]  Rf Fowler,et al.  Partitioning methods for unstructured finite element meshes , 1994 .

[39]  Peter M.-Y. Chow Control volume unstructured mesh procedure for convection-diffusion solidification processes , 1993 .

[40]  Roy Williams Performance of a Distributed Unstructured-Mesh Code for Transonic Flow , 1990 .

[41]  Tevfik Bultan,et al.  A New Mapping Heuristic Based on Mean Field Annealing , 1992, J. Parallel Distributed Comput..

[42]  G. A. Geist,et al.  A user's guide to PICL a portable instrumented communication library , 1990 .

[43]  M. Cross,et al.  An enthalpy method for convection/diffusion phase change , 1987 .

[44]  Joel H. Saltz,et al.  Principles of runtime support for parallel processors , 1988, ICS '88.

[45]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[46]  Bruce Hendrickson,et al.  A Multi-Level Algorithm For Partitioning Graphs , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[47]  Peter James Lawrence,et al.  Mesh generation by domain bisection , 1994 .

[48]  R F Fowler,et al.  RALPAR: RAL mesh partitioning program: version 1.1 , 1994 .

[49]  Utpal Banerjee,et al.  Speedup of ordinary programs , 1979 .

[50]  O. H. Lowry Academic press. , 1972, Analytical chemistry.

[51]  E. V. Krishnamurthy,et al.  Parallel processing - principles and practice , 1989, International computer science series.

[52]  Charbel Farhat On the mapping of massively parallel processors onto finite element graphs , 1989 .

[53]  Robert Schreiber,et al.  Mapping unstructured grid problems to the connection machine , 1992 .

[54]  Robert Haimes,et al.  pV3 - A distributed system for large-scale unsteady CFD visualization , 1994 .

[55]  Rolf Hempel,et al.  The ANL/GMD Macros (PARMACS) in FORTRAN for Portable Parallel Programming using the Message Passing , 1991 .

[56]  Keshav Pingali,et al.  Access normalization: loop restructuring for NUMA compilers , 1992, ASPLOS V.

[57]  R. Lathe Phd by thesis , 1988, Nature.

[58]  Jerzy W. Jaromczyk,et al.  A parallel mesh generation algorithm based on the vertex label assignment scheme , 1989 .

[59]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.

[60]  Yvonne Delia Fryer A control volume unstructured grid approach to the solution of the elastic stress-strain equations , 1993 .

[61]  Gabriel Kron,et al.  Diakoptics : the piecewise solution of large-scale systems , 1963 .

[62]  Roberto Battiti,et al.  The Reactive Tabu Search , 1994, INFORMS J. Comput..

[63]  Charbel Farhat,et al.  An Unconventional Domain Decomposition Method for an Efficient Parallel Solution of Large-Scale Finite Element Systems , 1992, SIAM J. Sci. Comput..

[64]  Bruce Hendrickson,et al.  An Improved Spectral Graph Partitioning Algorithm for Mapping Parallel Computations , 1995, SIAM J. Sci. Comput..

[65]  Jack J. Dongarra,et al.  A comparative study of automatic vectorizing compilers , 1991, Parallel Comput..

[66]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[67]  Fred W. Glover,et al.  Tabu Search - Part I , 1989, INFORMS J. Comput..

[68]  Reinhard von Hanxleden,et al.  Compiler support for machine-independent parallelization of irregular problems , 1994, Rice COMP TR.

[69]  Martin G. Everett,et al.  Partitioning & Mapping of Unstructured Meshes to Parallel Machine Topologies , 1995, IRREGULAR.

[70]  G. S. Rao,et al.  6th annual symposium on computer architecture , 1979 .

[71]  Martin Berzins,et al.  Scalable parallel generation of partitioned, unstructured meshes , 1996 .

[72]  J. A. Shaw,et al.  The modelling of aerodynamic flows by solution of the euler equations on mixed polyhedral grids , 1992 .

[73]  Bruno Buchberger,et al.  Parallel Processing: CONPAR 94 — VAPP VI , 1994, Lecture Notes in Computer Science.

[74]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[75]  Jack Dongarra,et al.  The Performance of PVM on MPP Systems , 1995 .

[76]  Stéphane Lanteri,et al.  Two-dimensional viscous flow computations on the Connection Machine: unstructured meshes, upwind schemes and massively parallel computations , 1993 .

[77]  Cos S. Ierotheou,et al.  User Interaction and Symbolic Extensions to Dependence Analysis , 1994, CONPAR.

[78]  C. Rhie,et al.  A numerical study of the turbulent flow past an isolated airfoil with trailing edge separation , 1982 .

[79]  Petter E. Bjørstad,et al.  Parallel Domain Decomposition Applied to Coupled Transport Equations , 1993 .

[80]  Beryl Wyn Jones,et al.  Mapping unstructured mesh codes onto local memory parallel architectures , 1993 .

[81]  P. F. Leggett,et al.  CAPTools-semiautomatic parallelisation of mesh based computational mechanics codes M.Cross, C.S.Ierotheou, S.P.Johnson and P.F.Leggett , .

[82]  W. R. Jones,et al.  Adaptive domain decomposition and parallel cfd , 1995 .