NORNS: Extending Slurm to Support Data-Driven Workflows through Asynchronous Data Staging

As HPC systems move into the Exascale era, parallel file systems are struggling to keep up with the I/O requirements from data-intensive problems. While the inclusion of burst buffers has helped to alleviate this by improving I/O performance, it has also increased the complexity of the I/O hierarchy by adding additional storage layers each with its own semantics. This forces users to explicitly manage data movement between the different storage layers, which, coupled with the lack of interfaces to communicate data dependencies between jobs in a data-driven workflow, prevents resource schedulers from optimizing these transfers to benefit the cluster’s overall performance. This paper proposes several extensions to job schedulers, prototyped using the Slurm scheduling system, to enable users to appropriately express the data dependencies between the different phases in their processing workflows. It also introduces a new service for asynchronous data staging called NORNS that coordinates with the job scheduler to orchestrate data transfers to achieve better resource utilization. Our evaluation shows that a workflow-aware Slurm exploits node-local storage more effectively, reducing the filesystem I/O contention and improving job running times.

[1]  J. Dongarra,et al.  HPCG Benchmark: a New Metric for Ranking High Performance Computing Systems∗ , 2015 .

[2]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[3]  Robert Latham,et al.  Understanding and improving computational science storage access through continuous characterization , 2011, MSST.

[4]  Nick Knupffer Intel Corporation , 2018, The Grants Register 2019.

[5]  Daniel S. Katz,et al.  Swift: A language for distributed parallel scripting , 2011, Parallel Comput..

[6]  Limin Xiao,et al.  A New File-Specific Stripe Size Selection Method for Highly Concurrent Data Access , 2012, 2012 ACM/IEEE 13th International Conference on Grid Computing.

[7]  J. B. Lister,et al.  THE ITER PROJECT AND ITS DATA HANDLING REQUIREMENTS , 2003 .

[8]  Jay F. Lofstead,et al.  Insights for exascale IO APIs from building a petascale IO API , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  Peter Braam,et al.  The Lustre Storage Architecture , 2019, ArXiv.

[10]  Scott Klasky,et al.  Stacker: An Autonomic Data Movement Engine for Extreme-Scale Data Staging-Based In-Situ Workflows , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Xian-He Sun,et al.  Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-volatile Burst Buffers , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[12]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[13]  Dong H. Ahn,et al.  Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters , 2016, HPDC.

[14]  Hrvoje Jasak,et al.  OpenFOAM: Open source CFD in research and industry , 2009 .

[15]  Arie Shoshani,et al.  Scientific Data Management - Challenges, Technology, and Deployment , 2009, Scientific Data Management.

[16]  Scott Klasky,et al.  Characterizing output bottlenecks in a supercomputer , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[18]  Jennifer M. Schopf,et al.  A General Architecture for Scheduling on the Grid , 2003 .

[19]  Robert B. Ross,et al.  Enabling Parallel Simulation of Large-Scale HPC Network Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.

[20]  Jesús Carretero,et al.  CLARISSE: A Middleware for Data-Staging Coordination and Control on Large-Scale HPC Platforms , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[21]  Karsten Schwan,et al.  Managing Variability in the IO Performance of Petascale Storage Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Jack J. Dongarra,et al.  Exascale computing and big data , 2015, Commun. ACM.

[23]  Lavanya Ramakrishnan,et al.  The future of scientific workflows , 2018, Int. J. High Perform. Comput. Appl..

[24]  L. Evans The Large Hadron Collider , 2012, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[25]  Ada Gavrilovska,et al.  NVStream: accelerating HPC workflows with NVRAM-based transport for streaming objects , 2018, HPDC.

[26]  Steven Swanson,et al.  Gordon: using flash memory to build fast, power-efficient clusters for data-intensive applications , 2009, ASPLOS.

[27]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[28]  João Paulo Teixeira,et al.  The CMS experiment at the CERN LHC , 2008 .

[29]  Kesheng Wu,et al.  Data Elevator: Low-Contention Data Movement in Hierarchical Storage System , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[30]  Maya Gokhale,et al.  Integrated in-system storage architecture for high performance computing , 2012, ROSS '12.

[31]  B. Viren,et al.  ASCR/HEP Exascale Requirements Review Report , 2016, 1603.09303.

[32]  Andrea C. Arpaci-Dusseau,et al.  Explicit Control in the Batch-Aware Distributed File System , 2004, NSDI.

[33]  Kento Aida,et al.  Evaluation of Performance Degradation in HPC Applications with VM Consolidation , 2012, 2012 Third International Conference on Networking and Computing.

[34]  Franck Cappello,et al.  Scheduling the I/O of HPC Applications Under Congestion , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[35]  Karsten Schwan,et al.  Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS) , 2008, CLADE '08.

[36]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[37]  Robert B. Ross,et al.  Mercury: Enabling remote procedure call for high-performance computing , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[38]  Mark Parsons,et al.  Exploiting the Performance Benefits of Storage Class Memory for HPC and HPDA Workflows , 2018, Supercomput. Front. Innov..

[39]  Robert Latham,et al.  Scalable I/O forwarding framework for high-performance computing systems , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[40]  Teng Wang,et al.  TRIO: Burst Buffer Based I/O Orchestration , 2015, 2015 IEEE International Conference on Cluster Computing.

[41]  Donald W. Sweeney,et al.  LSST Science Book, Version 2.0 , 2009, 0912.0201.