Understanding I/O Behavior in Scientific and Data-Intensive Computing (Dagstuhl Seminar 21332)

Two key changes are driving an immediate need for deeper understanding of I/O workloads in highperformance computing (HPC): applications are evolving beyond the traditional bulk-synchronous models to include integrated multistep workflows, in situ analysis, artificial intelligence, and data analytics methods; and storage systems designs are evolving beyond a two-tiered file system and archive model to complex hierarchies containing temporary, fast tiers of storage close to compute resources with markedly different performance properties. Both of these changes represent a significant departure from the decades-long status quo and require investigation from storage researchers and practitioners to understand their impacts on overall I/O performance. Without an in-depth understanding of I/O workload behavior, storage system designers, I/O middleware developers, facility operators, and application developers will not know how best to design or utilize the additional tiers for optimal performance of a given I/O workload. The goal of this Dagstuhl Seminar was to bring together experts in I/O performance analysis and storage system architecture to collectively evaluate how our community is capturing and analyzing I/O workloads on HPC systems, identify any gaps in our methodologies, and determine how to develop a better in-depth understanding of their impact on HPC systems. Our discussions were lively and resulted in identifying critical needs for research in the area of understanding I/O behavior. We document those discussions in this report. Seminar August 15–20, 2021 – https://www.dagstuhl.de/21332 2012 ACM Subject Classification General and reference → General literature; Hardware → 3D integrated circuits; Software and its engineering → Software design engineering; Networks → Network performance analysis

[1]  André Brinkmann,et al.  A Configurable Rule based Classful Token Bucket Filter Network Request Scheduler for the Lustre File System , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Xian-He Sun,et al.  IOSIG+: On the Role of I/O Tracing and Analysis for Hadoop Systems , 2015, 2015 IEEE International Conference on Cluster Computing.

[3]  Surendra Byna,et al.  Pattern-driven parallel I/O tuning , 2015, PDSW '15.

[4]  Robert B. Ross,et al.  A Visual Network Analysis Method for Large-Scale Parallel I/O Systems , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[5]  Achim Streit,et al.  ADA-FS - Advanced Data Placement via Ad hoc File Systems at Extreme Scales , 2020, Software for Exascale Computing.

[6]  Robert B. Ross,et al.  Ad Hoc File Systems for High-Performance Computing , 2020, Journal of Computer Science and Technology.

[7]  Robert Latham,et al.  Machine Learning Based Parallel I/O Predictive Modeling: A Case Study on Lustre File Systems , 2018, ISC.

[8]  Dan Feng,et al.  LPCC: hierarchical persistent client caching for lustre , 2019, SC.

[9]  Robert Ross,et al.  Uncovering Access, Reuse, and Sharing Characteristics of I/O-Intensive Files on Large-Scale Production HPC Systems , 2020, FAST.

[10]  Robert B. Ross,et al.  Omnisc'IO: A Grammar-Based Approach to Spatial and Temporal I/O Patterns Prediction , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Shane Snyder,et al.  IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[12]  Bronis R. de Supinski,et al.  Asynchronous checkpoint migration with MRNet in the Scalable Checkpoint / Restart Library , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[13]  Marco Berghoff,et al.  Using On-Demand File Systems in HPC Environments , 2019, 2019 International Conference on High Performance Computing & Simulation (HPCS).

[14]  Kevin Harms,et al.  Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems , 2019, 2019 IEEE/ACM Fourth International Parallel Data Systems Workshop (PDSW).

[15]  Eliakin Del Rosario,et al.  HPC I/O Throughput Bottleneck Analysis with Explainable Local Models , 2020, International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Andreas Knüpfer,et al.  PIKA: Center-Wide and Job-Aware Cluster Monitoring , 2020, 2020 IEEE International Conference on Cluster Computing (CLUSTER).

[17]  Kevin Harms,et al.  UMAMI: a recipe for generating meaningful metrics through holistic I/O performance analysis , 2017, PDSW-DISCS@SC.

[18]  Shane Snyder,et al.  A Zoom-in Analysis of I/O Logs to Detect Root Causes of I/O Performance Bottlenecks , 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[19]  Shane Snyder,et al.  Toward Understanding I/O Behavior in HPC Workflows , 2018, 2018 IEEE/ACM 3rd International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS).

[20]  Robert Latham,et al.  Analysis and Correlation of Application I/O Performance and System-Wide I/O Activity , 2017, 2017 International Conference on Networking, Architecture, and Storage (NAS).

[21]  Suren Byna,et al.  Understanding Data Motion in the Modern HPC Data Center , 2019, 2019 IEEE/ACM Fourth International Parallel Data Systems Workshop (PDSW).

[22]  Surendra Byna,et al.  Parallel I/O prefetching using MPI file caching and I/O signatures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[24]  Carla Schlatter Ellis,et al.  File-Access Characteristics of Parallel Scientific Workloads , 1996, IEEE Trans. Parallel Distributed Syst..

[25]  Julian M. Kunkel,et al.  Potential of I/O Aware Workflows in Climate and Weather , 2020, Supercomput. Front. Innov..

[26]  Weiguo Liu,et al.  End-to-end I/O Monitoring on Leading Supercomputers , 2022, NSDI.

[27]  Suren Byna,et al.  Active Learning-based Automatic Tuning and Prediction of Parallel I/O Performance , 2019, 2019 IEEE/ACM Fourth International Parallel Data Systems Workshop (PDSW).

[28]  Robert Latham,et al.  Adaptive Learning for Concept Drift in Application Performance Modeling , 2019, ICPP.

[29]  Robert Latham,et al.  Understanding and improving computational science storage access through continuous characterization , 2011, 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).

[30]  Marianne Winslett,et al.  A Multiplatform Study of I/O Behavior on Petascale Supercomputers , 2015, HPDC.

[31]  Robert B. Ross,et al.  Gauge: An Interactive Data-Driven Visualization Tool for HPC Application I/O Performance Analysis , 2020, 2020 IEEE/ACM Fifth International Parallel Data Systems Workshop (PDSW).