Flux: Overcoming Scheduling Challenges for Exascale Workflows

Many emerging scientific workflows that target high-end HPC systems require complex interplay with the resource and job management software~(RJMS). However, portable, efficient and easy-to-use scheduling and execution of these workflows is still an unsolved problem. We present Flux, a novel, hierarchical RJMS infrastructure that addresses the key scheduling challenges of modern workflows in a scalable, easy-to-use, and portable manner. At the heart of Flux lies its ability to be nested seamlessly within batch allocations created by other schedulers as well as itself. Once a hierarchy of Flux instance is created within each allocation, its consistent and rich set of well-defined APIs portably and efficiently support those workflows that can often feature non-traditional execution patterns such as requirements for complex co-scheduling, massive ensembles of small jobs and coordination among jobs in an ensemble.

[1]  Rajkumar Buyya,et al.  A Taxonomy of Workflow Management Systems for Grid Computing , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[2]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[3]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[4]  Jayson Luc Peterson,et al.  A HYDRA UQ Workflow for NIF Ignition Experiments , 2016, 2016 Second Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (ISAV).

[5]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[6]  Jim Gaffney,et al.  Thermodynamic modeling of uncertainties in NIF ICF implosions due to underlying microphysics models , 2014 .

[7]  Brian Spears,et al.  Data driven models of the performance and repeatability of NIF high foot implosions , 2015 .

[8]  John A. Gunnels,et al.  Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[9]  Berk Hess,et al.  GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers , 2015 .

[10]  Robert B. Ross,et al.  Supporting task-level fault-tolerance in HPC workflows by launching MPI jobs inside MPI jobs , 2017, WORKS@SC.

[11]  Philip B. Stark,et al.  Uncertainty quantification and error analysis , 2010 .

[12]  Dali Wang,et al.  A Data Analysis Framework for Earth System Simulation within an In-Situ Infrastructure , 2017 .

[13]  Miron Livny,et al.  Condor: a distributed job scheduler , 2001 .

[14]  Carla Mattos,et al.  A comprehensive survey of Ras mutations in cancer. , 2012, Cancer research.

[15]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[16]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.