Effective large scale computing software for parallel mesh generation

Scientists commonly turn to supercomputers or Clusters of Workstations with hundreds (even thousands) of nodes to generate meshes for large-scale simulations. Parallel mesh generation software is then used to decompose the original mesh generation problem into smaller sub-problems that can be solved (meshed) in parallel. The size of the final mesh is limited by the amount of aggregate memory of the parallel machine. Also, requesting many compute nodes on a shared computing resource may result in a long waiting, far surpassing the time it takes to solve the problem. These two problems (i.e., insufficient memory when computing on a small number of nodes, and long waiting times when using many nodes from a shared computing resource) can be addressed by using out-of-core algorithms. These are algorithms that keep most of the dataset out-of-core (i.e., outside of memory, on disk) and load only a portion in-core (i.e., into memory) at a time. We explored two approaches to out-of-core computing. First, we presented a traditional approach, which is to modify the existing in-core algorithms to enable out-of-core computing. While we achieved good performance with this approach the task is complex and labor intensive. An alternative approach, we presented a runtime system designed to support out-of-core applications. It requires little modification of the existing in-core application code and still produces acceptable results. Evaluation of the runtime system showed little performance degradation while simplifying and shortening the development cycle of out-of-core applications. The overhead from using the runtime system for small problem sizes is between 12% and 41% while the overlap of computation, communication and disk I/O is above 50% and as high as 61% for large problems. The main contribution of our work is the ability to utilize computing resources more effectively. The user has a choice of either solving larger problems, that otherwise would not be possible, or solving problems of the same size but using fewer computing nodes, thus reducing the waiting time on shared clusters and supercomputers. We demonstrated that the latter could potentially lead to substantially shorter wall-clock time.

[1]  Andriy Fedorov,et al.  Communication support for dynamic load balancing of irregular adaptive applications , 2004 .

[2]  Andrey N. Chernikov,et al.  A multigrain Delaunay mesh generation method for multicore SMT-based architectures , 2009, J. Parallel Distributed Comput..

[3]  Nikos Chrisochoides,et al.  Graded Delaunay Decoupling Method for Parallel Guaranteed Quality Planar Mesh Generation , 2008, SIAM J. Sci. Comput..

[4]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[5]  Andrey N. Chernikov,et al.  Algorithm 872: Parallel 2D constrained Delaunay mesh generation , 2008, TOMS.

[6]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[7]  Sivan Toledo,et al.  The design and implementation of SOLAR, a portable library for scalable out-of-core linear algebra computations , 1996, IOPADS '96.

[8]  Jack Dongarra,et al.  Prospectus for the Development of a Linear Algebra Library for High-Performance Computers , 1997 .

[9]  Jeffrey Scott Vitter,et al.  Algorithms for parallel memory, II: Hierarchical multilevel memories , 1992, Algorithmica.

[10]  Guang R. Gao,et al.  ABC++: Concurrency by Inheritance in C++ , 1995, IBM Syst. J..

[11]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[12]  Andrey N. Chernikov,et al.  Generalized Two-Dimensional Delaunay Mesh Refinement , 2009, SIAM J. Sci. Comput..

[13]  Jaeyoung Choi,et al.  Scalable linear algebra software libraries for distributed memory concurrent computers , 1995, Proceedings of the Fifth IEEE Computer Society Workshop on Future Trends of Distributed Computing Systems.

[14]  Dhabaleswar K. Panda,et al.  High Performance Remote Memory Access Communication: The Armci Approach , 2006, Int. J. High Perform. Comput. Appl..

[15]  Andrew P. Black,et al.  Fine-grained mobility in the Emerald system , 1987, TOCS.

[16]  David R. O'Hallaron,et al.  Extracting Hexahedral Mesh Structures from Balanced Linear Octrees , 2004, IMR.

[17]  Andrey N. Chernikov,et al.  Three-dimensional Semi-generalized Point Placement Method for Delaunay Mesh Refinement , 2007, IMR.

[18]  Andrey N. Chernikov,et al.  Out-of-Core Parallel Delaunay Mesh Generation ∗ Extended Abstract , .

[19]  John Freeman,et al.  Lambda functions for C++0x , 2008, SAC '08.

[20]  Andriy Fedorov,et al.  Location management in object-based distributed computing , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[21]  S. VitterJ.,et al.  Algorithms for parallel memory, I , 1994 .

[22]  Jeffrey Scott Vitter,et al.  Large-Scale Sorting in Uniform Memory Hierarchies , 1993, J. Parallel Distributed Comput..

[23]  David R. O'Hallaron,et al.  A Computational Database System for Generatinn Unstructured Hexahedral Meshes with Billions of Elements , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[24]  Michael S. Warren,et al.  Parallel, Out-of-Core Methods for N-body Simulation , 1997, PPSC.

[25]  Joel H. Saltz,et al.  CHAOS++: A Runtime Library for Supporting Distributed Dynamic Data Structures , 1995 .

[26]  Andrey N. Chernikov,et al.  The Evaluation of an Effective Out-of-Core Run-Time System in the Context of Parallel Mesh Generation , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[27]  Andrey N. Chernikov,et al.  Parallel Guaranteed Quality Delaunay Uniform Mesh Refinement , 2006, SIAM J. Sci. Comput..

[28]  Kevin J. Barker,et al.  Practical performance model for optimizing dynamic load balancing of adaptive applications , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[29]  Andrey N. Chernikov,et al.  Effective out-of-core parallel Delaunay mesh refinement using off-the-shelf software , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[30]  Adrian Bowyer,et al.  Computing Dirichlet Tessellations , 1981, Comput. J..

[31]  Jeffrey S. Chase,et al.  The Amber system: parallel programming on a network of multiprocessors , 1989, SOSP '89.

[32]  J. Shewchuk,et al.  Delaunay refinement mesh generation , 1997 .

[33]  Andrey N. Chernikov,et al.  Practical and efficient point insertion scheduling method for parallel guaranteed quality delaunay refinement , 2004, ICS '04.

[34]  Keshav Pingali,et al.  A load balancing framework for adaptive and asynchronous applications , 2004, IEEE Transactions on Parallel and Distributed Systems.

[35]  Andrey N. Chernikov,et al.  Three-dimensional delaunay refinement for multi-core processors , 2008, ICS '08.

[36]  Jeffrey Scott Vitter,et al.  Greed sort: optimal deterministic sorting on parallel disks , 1995, JACM.

[37]  Frank Dehne,et al.  Efficient External Memory Algorithms by Simulating Coarse-Grained Parallel Algorithms , 2002, Algorithmica.

[38]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[39]  D. F. Watson Computing the n-Dimensional Delaunay Tesselation with Application to Voronoi Polytopes , 1981, Comput. J..