Mapping unstructured mesh codes onto local memory parallel architectures

Initial work on mapping CFD codes onto parallel systems focused upon software which employed structured meshes. Increasingly, many large scale CFD codes are being based upon unstructured meshes. One of the key problems when implementing such large scale unstructured problems on a distributed memory machine is the question of how to partition the underlying computational domain efficiently. It is important that all processors are kept busy for as large a proportion of the time as possible and that the amount, level and frequency of communication should be kept to a minimum. Proposed techniques for solving the mapping problem have separated out the solution into two distinct phases. The first phase is to partition the computational domain into cohesive sub-regions. The second phase consists of embedding these sub-regions onto the processors. However, it has been shown that performing these two operations in isolation can lead to poor mappings and much less optimal communication time. In this thesis we develop a technique which simultaneously takes account of the processor topology whilst identifying the cohesive sub-regions. Our approach is based on an unstructured mesh decomposition method that was originally developed by Sadayappan et al [SER90] for a hypercube. This technique forms a basis for a method which enables a decomposition to an arbitrary number of processors on a specified processor network topology. Whilst partitioning the mesh, the optimisation method takes into account the processor topology by minimising the total interprocessor communication. The problem with this technique is that it is not suitable for dealing with very large meshes since the calculations often require prodigious amounts of computing processing power. The problem can be overcome by creating clusters of the original elements and using this to create a reduced network which is homomorphic to the original mesh. The technique can now be applied to the image network with comparative ease. The clusters are created using an efficient graph bisection method. The coarseness of the reduced mesh inevitably leads to a degradation of the solution. However, it is possible to refine the resultant partition to recapture some of the richness of the original mesh and hence achieve reasonable partitions. One of the issues to be addressed is the level of granuality to obtain the best balance between computational efficiency and optimality of the solution. Some progress has been made in trying to find an answer to this important issue. In this thesis, we show how the above technique can be effectively utilised in large scale computations. Results include testing the above technique on large scale meshes for complex flow domains.

[1]  D. R. Fulkerson,et al.  Flows in Networks. , 1964 .

[2]  G. G. Alway,et al.  An algorithm for reducing the bandwidth of a matrix of symmetrical configuration , 1965, Comput. J..

[3]  Richard Rosen Matrix bandwidth minimization , 1968, ACM National Conference.

[4]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[5]  Robin J. Wilson Introduction to Graph Theory , 1974 .

[6]  Richard M. Karp,et al.  Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems , 1972, Combinatorial Optimization.

[7]  M. Fiedler Algebraic connectivity of graphs , 1973 .

[8]  Alex Pothen,et al.  PARTITIONING SPARSE MATRICES WITH EIGENVECTORS OF GRAPHS* , 1990 .

[9]  J. A. Bondy,et al.  Graph Theory with Applications , 1978 .

[10]  Harold S. Stone,et al.  Multiprocessor Scheduling with the Aid of Network Flow Algorithms , 1977, IEEE Transactions on Software Engineering.

[11]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[12]  Shahid H. Bokhari,et al.  On the Mapping Problem , 1981, IEEE Transactions on Computers.

[13]  R. M. Mattheyses,et al.  A Linear-Time Heuristic for Improving Network Partitions , 1982, 19th Design Automation Conference.

[14]  S. Micali,et al.  Priority queues with variable priority and an O(EV log V) algorithm for finding a maximal weighted matching in general graphs , 1982, FOCS 1982.

[15]  B. R. Baliga,et al.  A CONTROL VOLUME FINITE-ELEMENT METHOD FOR TWO-DIMENSIONAL FLUID FLOW AND HEAT TRANSFER , 1983 .

[16]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[17]  C. Rhie,et al.  Numerical Study of the Turbulent Flow Past an Airfoil with Trailing Edge Separation , 1983 .

[18]  Chien-Chung Shen,et al.  A Graph Matching Approach to Optimal Task Assignment in Distributed Computing Systems Using a Minimax Criterion , 1985, IEEE Trans. Computers.

[19]  P. Ducksbury Parallel array processing , 1986 .

[20]  Jake K. Aggarwal,et al.  A Mapping Strategy for Parallel Processing , 1987, IEEE Transactions on Computers.

[21]  Shahid H. Bokhari,et al.  A Partitioning Strategy for Nonuniform Problems on Multiprocessors , 1987, IEEE Transactions on Computers.

[22]  P. Sadayappan,et al.  Task allocation onto a hypercube by recursive mincut bipartitioning , 1988, C3P.

[23]  Charbel Farhat,et al.  A simple and efficient automatic fem domain decomposer , 1988 .

[24]  Virginia Mary Lo,et al.  Heuristic Algorithms for Task Assignment in Distributed Systems , 1988, IEEE Trans. Computers.

[25]  Joel M. Crichlow An introduction to distributed and parallel computing , 1988 .

[26]  Cecilia R. Aragon,et al.  Optimization by Simulated Annealing: An Experimental Evaluation; Part I, Graph Partitioning , 1989, Oper. Res..

[27]  Robert G. Webster,et al.  The application of finite volume methods for modelling three-dimensional incompressible flow on an unstructured mesh , 1989 .

[28]  Fred W. Glover,et al.  Tabu Search - Part I , 1989, INFORMS J. Comput..

[29]  S. Lennart Johnsson,et al.  Optimum Broadcasting and Personalized Communication in Hypercubes , 1989, IEEE Trans. Computers.

[30]  J. Ramanujam,et al.  Cluster partitioning approaches to mapping parallel programs onto a hypercube , 1987, Parallel Comput..

[31]  Fred Glover,et al.  Tabu Search - Part II , 1989, INFORMS J. Comput..

[32]  Roy D. Williams,et al.  Performance of dynamic load balancing algorithms for unstructured mesh calculations , 1991, Concurr. Pract. Exp..

[33]  M. Cross,et al.  Mapping structured grid three-dimensional CFD codes onto parallel architectures , 1991 .

[34]  Y. M. Chee,et al.  Graph partitioning using tabu search , 1991, 1991., IEEE International Sympoisum on Circuits and Systems.

[35]  Horst D. Simon,et al.  Partitioning of unstructured problems for parallel processing , 1991 .

[36]  M. N. Shanmukha Swamy,et al.  Simulated Annealing and Tabu Search Algorithms for Multiway Graph Partition , 1992, J. Circuits Syst. Comput..

[37]  Peter M.-Y. Chow Control volume unstructured mesh procedure for convection-diffusion solidification processes , 1993 .

[38]  Horst D. Simon,et al.  Fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems , 1994, Concurr. Pract. Exp..

[39]  Martin G. Everett,et al.  A Parallelisable Algorithm for Partitioning Unstructured Meshes , 1995 .

[40]  Martin G. Everett,et al.  Parallel unstructured mesh CFD codes: A role for recursive clustering techniques in mesh decomposition , 1995 .

[41]  S.,et al.  An Efficient Heuristic Procedure for Partitioning Graphs , 2022 .