论文信息 - Understanding I/O Behavior in Scientific and Data-Intensive Computing (Dagstuhl Seminar 21332)

Understanding I/O Behavior in Scientific and Data-Intensive Computing (Dagstuhl Seminar 21332)

Two key changes are driving an immediate need for deeper understanding of I/O workloads in highperformance computing (HPC): applications are evolving beyond the traditional bulk-synchronous models to include integrated multistep workflows, in situ analysis, artificial intelligence, and data analytics methods; and storage systems designs are evolving beyond a two-tiered file system and archive model to complex hierarchies containing temporary, fast tiers of storage close to compute resources with markedly different performance properties. Both of these changes represent a significant departure from the decades-long status quo and require investigation from storage researchers and practitioners to understand their impacts on overall I/O performance. Without an in-depth understanding of I/O workload behavior, storage system designers, I/O middleware developers, facility operators, and application developers will not know how best to design or utilize the additional tiers for optimal performance of a given I/O workload. The goal of this Dagstuhl Seminar was to bring together experts in I/O performance analysis and storage system architecture to collectively evaluate how our community is capturing and analyzing I/O workloads on HPC systems, identify any gaps in our methodologies, and determine how to develop a better in-depth understanding of their impact on HPC systems. Our discussions were lively and resulted in identifying critical needs for research in the area of understanding I/O behavior. We document those discussions in this report. Seminar August 15–20, 2021 – https://www.dagstuhl.de/21332 2012 ACM Subject Classification General and reference → General literature; Hardware → 3D integrated circuits; Software and its engineering → Software design engineering; Networks → Network performance analysis

[1] André Brinkmann,et al. A Configurable Rule based Classful Token Bucket Filter Network Request Scheduler for the Lustre File System , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[2] Xian-He Sun,et al. IOSIG+: On the Role of I/O Tracing and Analysis for Hadoop Systems , 2015, 2015 IEEE International Conference on Cluster Computing.

[3] Surendra Byna,et al. Pattern-driven parallel I/O tuning , 2015, PDSW '15.

[4] Robert B. Ross,et al. A Visual Network Analysis Method for Large-Scale Parallel I/O Systems , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[5] Achim Streit,et al. ADA-FS - Advanced Data Placement via Ad hoc File Systems at Extreme Scales , 2020, Software for Exascale Computing.

[6] Robert B. Ross,et al. Ad Hoc File Systems for High-Performance Computing , 2020, Journal of Computer Science and Technology.

[7] Robert Latham,et al. Machine Learning Based Parallel I/O Predictive Modeling: A Case Study on Lustre File Systems , 2018, ISC.

[8] Dan Feng,et al. LPCC: hierarchical persistent client caching for lustre , 2019, SC.

[9] Robert Ross,et al. Uncovering Access, Reuse, and Sharing Characteristics of I/O-Intensive Files on Large-Scale Production HPC Systems , 2020, FAST.

[10] Robert B. Ross,et al. Omnisc'IO: A Grammar-Based Approach to Spatial and Temporal I/O Patterns Prediction , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11] Shane Snyder,et al. IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[12] Bronis R. de Supinski,et al. Asynchronous checkpoint migration with MRNet in the Scalable Checkpoint / Restart Library , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[13] Marco Berghoff,et al. Using On-Demand File Systems in HPC Environments , 2019, 2019 International Conference on High Performance Computing & Simulation (HPCS).

[14] Kevin Harms,et al. Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems , 2019, 2019 IEEE/ACM Fourth International Parallel Data Systems Workshop (PDSW).

[15] Eliakin Del Rosario,et al. HPC I/O Throughput Bottleneck Analysis with Explainable Local Models , 2020, International Conference for High Performance Computing, Networking, Storage and Analysis.

[16] Andreas Knüpfer,et al. PIKA: Center-Wide and Job-Aware Cluster Monitoring , 2020, 2020 IEEE International Conference on Cluster Computing (CLUSTER).

[17] Kevin Harms,et al. UMAMI: a recipe for generating meaningful metrics through holistic I/O performance analysis , 2017, PDSW-DISCS@SC.

[18] Shane Snyder,et al. A Zoom-in Analysis of I/O Logs to Detect Root Causes of I/O Performance Bottlenecks , 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[19] Shane Snyder,et al. Toward Understanding I/O Behavior in HPC Workflows , 2018, 2018 IEEE/ACM 3rd International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS).

[20] Robert Latham,et al. Analysis and Correlation of Application I/O Performance and System-Wide I/O Activity , 2017, 2017 International Conference on Networking, Architecture, and Storage (NAS).

[21] Suren Byna,et al. Understanding Data Motion in the Modern HPC Data Center , 2019, 2019 IEEE/ACM Fourth International Parallel Data Systems Workshop (PDSW).

[22] Surendra Byna,et al. Parallel I/O prefetching using MPI file caching and I/O signatures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[23] Allen D. Malony,et al. The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[24] Carla Schlatter Ellis,et al. File-Access Characteristics of Parallel Scientific Workloads , 1996, IEEE Trans. Parallel Distributed Syst..

[25] Julian M. Kunkel,et al. Potential of I/O Aware Workflows in Climate and Weather , 2020, Supercomput. Front. Innov..

[26] Weiguo Liu,et al. End-to-end I/O Monitoring on Leading Supercomputers , 2022, NSDI.

[27] Suren Byna,et al. Active Learning-based Automatic Tuning and Prediction of Parallel I/O Performance , 2019, 2019 IEEE/ACM Fourth International Parallel Data Systems Workshop (PDSW).

[28] Robert Latham,et al. Adaptive Learning for Concept Drift in Application Performance Modeling , 2019, ICPP.

[29] Robert Latham,et al. Understanding and improving computational science storage access through continuous characterization , 2011, 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).

[30] Marianne Winslett,et al. A Multiplatform Study of I/O Behavior on Petascale Supercomputers , 2015, HPDC.

[31] Robert B. Ross,et al. Gauge: An Interactive Data-Driven Visualization Tool for HPC Application I/O Performance Analysis , 2020, 2020 IEEE/ACM Fifth International Parallel Data Systems Workshop (PDSW).