Hierarchical task mapping of cell-based AMR cosmology simulations

Cosmology simulations are highly communication-intensive, thus it is critical to exploit topology-aware task mapping techniques for performance optimization. To exploit the architectural properties of multiprocessor clusters (the performance gap between inter-node and intra-node communication as well as the gap between inter-socket and intra-socket communication), we design and develop a hierarchical task mapping scheme for cell-based AMR (Adaptive Mesh Refinement) cosmology simulations, in particular, the ART application. Our scheme consists of two parts: (1) an inter-node mapping to map application processes onto nodes with the objective of minimizing network traffic among nodes and (2) an intra-node mapping within each node to minimize the maximum size of messages transmitted between CPU sockets. Experiments on production supercomputers with 3D torus and fat-tree topologies show that our scheme can significantly reduce application communication cost by up to 50%. More importantly, our scheme is generic and can be extended to many other applications.

[1]  Jiazheng Zhou,et al.  Hierarchical Mapping for HPC Applications , 2011, IPDPS Workshops.

[2]  Arthur R. Butz,et al.  Alternative Algorithm for Hilbert's Space-Filling Curve , 1971, IEEE Transactions on Computers.

[3]  J. Tinsley Oden,et al.  Problem decomposition for adaptive hp finite element methods , 1995 .

[4]  A. Klypin,et al.  Adaptive Refinement Tree: A New High-Resolution N-Body Code for Cosmological Simulations , 1997, astro-ph/9701195.

[5]  M. S. Warren,et al.  A parallel hashed Oct-Tree N-body algorithm , 1993, Supercomputing '93.

[6]  Timothy C. Warburton,et al.  Extreme-Scale AMR , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Justin Luitjens,et al.  Improving the performance of Uintah: A large-scale adaptive meshing computational framework , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[8]  George Karypis,et al.  Multilevel k-way Partitioning Scheme for Irregular Graphs , 1998, J. Parallel Distributed Comput..

[9]  Laxmikant V. Kale,et al.  Automating Topology Aware Mapping for Supercomputers , 2010 .

[10]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[11]  Torsten Hoefler,et al.  Generic topology mapping strategies for large-scale parallel architectures , 2011, ICS '11.

[12]  Alexei M. Khokhlov,et al.  Fully Threaded Tree Algorithms for Adaptive Refinement Fluid Dynamics Simulations , 1997, astro-ph/9701194.

[13]  Jean Roman,et al.  SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs , 1996, HPCN Europe.

[14]  Jingjin Wu,et al.  Performance Emulation of Cell-Based AMR Cosmology Simulations , 2011, 2011 IEEE International Conference on Cluster Computing.

[15]  J. Ramanujam,et al.  Task allocation onto a hypercube by recursive mincut bipartitioning , 1990, C3P.

[16]  F. Pellegrini,et al.  Static mapping by dual recursive bipartitioning of process architecture graphs , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[17]  V. Gregory Weirs,et al.  Adaptive Mesh Refinement - Theory and Applications , 2008 .

[18]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[19]  Justin Luitjens,et al.  Dynamic task scheduling for the Uintah framework , 2010, 2010 3rd Workshop on Many-Task Computing on Grids and Supercomputers.

[20]  A. Kravtsov High-resolution simulations of structure formation in the universe , 1999 .

[21]  Manish Parashar,et al.  An Application-Centric Characterization of Domain-Based SFC Partitioners for Parallel SAMR , 2002, IEEE Trans. Parallel Distributed Syst..

[22]  Laxmikant V. Kalé,et al.  Topology-aware task mapping for reducing communication contention on large parallel machines , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[23]  Laxmikant V. Kalé,et al.  Application-specific topology-aware mapping for three dimensional topologies , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[24]  Jingjin Wu,et al.  Improving Parallel IO Performance of Cell-based AMR Cosmology Applications , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[25]  Jiazheng Zhou,et al.  Scalable Communication-aware Task Mapping Algorithms for Interconnected Multicore Systems , 2011, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.