Toward Understanding I/O Behavior in HPC Workflows

Scientific discovery increasingly depends on complex workflows consisting of multiple phases and sometimes millions of parallelizable tasks or pipelines. These workflows access storage resources for a variety of purposes, including preprocessing, simulation output, and postprocessing steps. Unfortunately, most workflow models focus on the scheduling and allocation of computational resources for tasks while the impact on storage systems remains a secondary objective and an open research question. I/O performance is not usually accounted for in workflow telemetry reported to users. In this paper, we present an approach to augment the I/O efficiency of the individual tasks of workflows by combining workflow description frameworks with system I/O telemetry data. A conceptual architecture and a prototype implementation for HPC data center deployments are introduced. We also identify and discuss challenges that will need to be addressed by workflow management and monitoring systems for HPC in the future. We demonstrate how real-world applications and workflows could benefit from the approach, and we show how the approach helps communicate performance-tuning guidance to users.

[1]  Ian T. Foster,et al.  Compiler Techniques for Massively Scalable Implicit Task Parallelism , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Daniel S. Katz,et al.  Swift: A language for distributed parallel scripting , 2011, Parallel Comput..

[3]  Ian T. Foster,et al.  A model for tracing and debugging large-scale task-parallel programs with MPE , 2012 .

[4]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[5]  Surendra Byna,et al.  DXT: Darshan eXtended Tracing , 2019 .

[6]  Prabhat,et al.  Storage 2020: A Vision for the Future of HPC Storage , 2017 .

[7]  Dean N. Williams,et al.  A workflow-enabled big data analytics software stack for escience , 2015, 2015 International Conference on High Performance Computing & Simulation (HPCS).

[8]  Devarshi Ghoshal,et al.  Tigres Workflow Library: Supporting Scientific Pipelines on HPC Systems , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[9]  Daniel S. Katz,et al.  Turbine: A Distributed-memory Dataflow Engine for High Performance Many-task Applications , 2013, Fundam. Informaticae.

[10]  Justin M. Wozniak,et al.  Lessons Learned from Building In Situ Coupling Frameworks , 2015, ISAV@SC.

[11]  Lavanya Ramakrishnan,et al.  The future of scientific workflows , 2018, Int. J. High Perform. Comput. Appl..

[12]  Shane Snyder,et al.  A Year in the Life of a Parallel File System , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[14]  Wei Chen,et al.  FireWorks: a dynamic workflow system designed for high‐throughput applications , 2015, Concurr. Comput. Pract. Exp..

[15]  Kevin Harms,et al.  UMAMI: a recipe for generating meaningful metrics through holistic I/O performance analysis , 2017, PDSW-DISCS@SC.

[16]  Kevin Harms,et al.  TOKIO on ClusterStor: Connecting Standard Tools to Enable Holistic I/O Performance Analysis , 2018 .