Toward Improved Support for Loosely Coupled Large Scale Simulation Workflows

High-performance computing (HPC) workloads are increasingly leveraging loosely coupled large scale simulations. Unfortunately, most large-scale HPC platforms, including Cray/ALPS environments, are designed for the execution of long-running jobs based on coarse-grained launch capabilities (e.g., one MPI rank per core on all allocated compute nodes). This assumption limits capability-class workload campaigns that require large numbers of discrete or loosely coupled simulations, and where time-to-solution is an untenable pacing issue. This paper describes the challenges related to the support of fine-grained launch capabilities that are necessary for the execution of loosely coupled large scale simulations on Cray/ALPS platforms. More precisely, we present the details of an enhanced runtime system to support this use case, and report on initial results from early testing on systems at Oak Ridge National Laboratory. Keywords-Framework, Runtime, Ensemble computing.

[1]  Shantenu Jha,et al.  SAGA BigJob: An Extensible and Interoperable Pilot-Job Abstraction for Distributed Applications and Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[2]  Samantha S. Foley,et al.  Parameter Sweep and Optimization of Loosely Coupled Simulations Using the DAKOTA Toolkit , 2012, 2012 IEEE 15th International Conference on Computational Science and Engineering.

[3]  David E. Bernholdt,et al.  Hobbes: composition and virtualization as the foundations of an extreme-scale OS/R , 2013, ROSS '13.

[4]  Christian Engelmann,et al.  A Runtime Environment for Supporting Research in Resilient HPC System Software & Tools , 2013, 2013 First International Symposium on Computing and Networking.

[5]  Zhao Zhang,et al.  Toward loosely coupled programming on petascale systems , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Jian Huang,et al.  Eden: Simplified Management of Atypical High-Performance Computing Jobs , 2013, Computing in Science & Engineering.

[7]  Samantha S. Foley,et al.  The Design and Implementation of the SWIM Integrated Plasma Simulator , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[8]  Samantha S. Foley,et al.  Multi-level concurrency in a framework for integrated loosely coupled plasma simulations , 2011, 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA).

[9]  Jeffrey M. Squyres,et al.  The Component Architecture of Open MPI: Enabling Third-Party Collective Algorithms* , 2005 .

[10]  Yong Zhao,et al.  Many-task computing for grids and supercomputers , 2008, 2008 Workshop on Many-Task Computing on Grids and Supercomputers.