The Evaluation of an Effective Out-of-Core Run-Time System in the Context of Parallel Mesh Generation

We present an out-of-core run-time system that supports effective parallel computation of large irregular and adaptive problems, in particular unstructured mesh generation (PUMG). PUMG is a highly challenging application due to intensive memory accesses, unpredictable communication patterns, and variable and irregular data dependencies reflecting the unstructured spatial connectivity of mesh elements. Our runtime system allows to transform the footprint of parallel applications from wide and shallow into narrow and deep by extending the memory utilization to the out-of-core level. It simplifies and streamlines the development of otherwise highly time consuming out-of-core applications as well as the converting of existing applications. It utilizes disk, network and memory hierarchy to achieve high utilization of computing resources without sacrificing performance with PUMG. The runtime system combines different programming paradigms: multi-threading within the nodes using industrial strength software framework, one-sided active messages among the nodes, and an out-of-core subsystem for managing large datasets. We performed an evaluation on traditional parallel platforms to stress test all layers of the run-time system using three different PUMG methods with significantly varying communication and synchronization patterns. We demonstrated high overlap in computation, communication, and disk I/O which results in good performance when computing large out-of-core problems. The runtime system adds very small overhead~(up to 18\% on most configurations) when computing in-core which means performance is not compromised.

[1]  Kevin J. Barker,et al.  An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[2]  Georg Stadler,et al.  Towards adaptive mesh PDE simulations on petascale computers , 2008 .

[3]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[4]  Andrey N. Chernikov,et al.  Algorithm 872: Parallel 2D constrained Delaunay mesh generation , 2008, TOMS.

[5]  Andrey N. Chernikov,et al.  Practical and efficient point insertion scheduling method for parallel guaranteed quality delaunay refinement , 2004, ICS '04.

[6]  Andrey N. Chernikov,et al.  Generalized Delaunay Mesh Refinement: From Scalar to Parallel , 2006, IMR.

[7]  Andrey N. Chernikov,et al.  A multigrain Delaunay mesh generation method for multicore SMT-based architectures , 2009, J. Parallel Distributed Comput..

[8]  Andrey N. Chernikov,et al.  Effective out-of-core parallel Delaunay mesh refinement using off-the-shelf software , 2006, IPDPS.

[9]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[10]  Nikos Chrisochoides,et al.  Guaranteed: quality parallel delaunay refinement for restricted polyhedral domains , 2002, SCG '02.

[11]  Courtenay T. Vaughan,et al.  Design of dynamic load-balancing tools for parallel applications , 2000, ICS '00.

[12]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[13]  Kirk L. Johnson,et al.  CRL: high-performance all-software distributed shared memory , 1995, SOSP.

[14]  Keshav Pingali,et al.  A load balancing framework for adaptive and asynchronous applications , 2004, IEEE Transactions on Parallel and Distributed Systems.

[15]  Andrey N. Chernikov,et al.  Parallel Guaranteed Quality Delaunay Uniform Mesh Refinement , 2006, SIAM J. Sci. Comput..

[16]  Andriy Fedorov,et al.  Location management in object-based distributed computing , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[17]  Georg Stadler,et al.  Scalable adaptive mantle convection simulation on petascale supercomputers , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Andrey N. Chernikov,et al.  Effective out-of-core parallel Delaunay mesh refinement using off-the-shelf software , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[19]  Nikos Chrisochoides,et al.  Algorithm 870: A static geometric Medial Axis domain decomposition in 2D Euclidean space , 2008, TOMS.

[20]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[21]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[22]  Timothy J. Tautges,et al.  Interoperable mesh and geometry tools for advanced petascale simulations , 2007 .

[23]  Wei Chen,et al.  Materials integrity in microsystems: a framework for a petascale predictive-science-based multiscale modeling and simulation system , 2008 .

[24]  Nikos Chrisochoides,et al.  Graded Delaunay Decoupling Method for Parallel Guaranteed Quality Planar Mesh Generation , 2008, SIAM J. Sci. Comput..

[25]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[26]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[27]  Andrey N. Chernikov,et al.  Three-dimensional delaunay refinement for multi-core processors , 2008, ICS '08.

[28]  David R. O'Hallaron,et al.  Scalable Parallel Octree Meshing for TeraScale Applications , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[29]  Dhabaleswar K. Panda,et al.  High Performance Remote Memory Access Communication: The Armci Approach , 2006, Int. J. High Perform. Comput. Appl..

[30]  Andriy Kot,et al.  "Green" multi-layered "smart" memory management system , 2003, Second IEEE International Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, 2003. Proceedings.