论文信息 - Flux: Overcoming Scheduling Challenges for Exascale Workflows

Flux: Overcoming Scheduling Challenges for Exascale Workflows

Many emerging scientific workflows that target high-end HPC systems require complex interplay with the resource and job management software~(RJMS). However, portable, efficient and easy-to-use scheduling and execution of these workflows is still an unsolved problem. We present Flux, a novel, hierarchical RJMS infrastructure that addresses the key scheduling challenges of modern workflows in a scalable, easy-to-use, and portable manner. At the heart of Flux lies its ability to be nested seamlessly within batch allocations created by other schedulers as well as itself. Once a hierarchy of Flux instance is created within each allocation, its consistent and rich set of well-defined APIs portably and efficiently support those workflows that can often feature non-traditional execution patterns such as requirements for complex co-scheduling, massive ensembles of small jobs and coordination among jobs in an ensemble.

[1] Rajkumar Buyya,et al. A Taxonomy of Workflow Management Systems for Grid Computing , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[2] Andy B. Yoo,et al. Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[3] Daniel S. Katz,et al. Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[4] Jayson Luc Peterson,et al. A HYDRA UQ Workflow for NIF Ignition Experiments , 2016, 2016 Second Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (ISAV).

[5] Ian T. Foster,et al. Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[6] Jim Gaffney,et al. Thermodynamic modeling of uncertainties in NIF ICF implosions due to underlying microphysics models , 2014 .

[7] Brian Spears,et al. Data driven models of the performance and repeatability of NIF high foot implosions , 2015 .

[8] John A. Gunnels,et al. Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[9] Berk Hess,et al. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers , 2015 .

[10] Robert B. Ross,et al. Supporting task-level fault-tolerance in HPC workflows by launching MPI jobs inside MPI jobs , 2017, WORKS@SC.

[11] Philip B. Stark,et al. Uncertainty quantification and error analysis , 2010 .

[12] Dali Wang,et al. A Data Analysis Framework for Earth System Simulation within an In-Situ Infrastructure , 2017 .

[13] Miron Livny,et al. Condor: a distributed job scheduler , 2001 .

[14] Carla Mattos,et al. A comprehensive survey of Ras mutations in cancer. , 2012, Cancer research.

[15] Randy H. Katz,et al. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[16] Carlo Curino,et al. Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.