论文信息 - IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs

IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs

Modern HPC systems are collecting large amounts of I/O performance data. The massive volume and heterogeneity of this data, however, have made timely performance of in-depth integrated analysis difficult. To overcome this difficulty and to allow users to identify the root causes of poor application I/O performance, we present IOMiner, an I/O log analytics framework. IOMiner provides an easy-to-use interface for analyzing instrumentation data, a unified storage schema that hides the heterogeneity of the raw instrumentation data, and a sweep-line-based algorithm for root cause analysis of poor application I/O performance. IOMiner is implemented atop Spark to facilitate efficient, interactive, parallel analysis. We demonstrate the capabilities of IOMiner by using it to analyze logs collected on a large-scale production HPC system. Our analysis techniques not only uncover the root cause of poor I/O performance in key application case studies but also provide new insight into HPC I/O workload characterization.

[1] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[2] Kevin Harms,et al. TOKIO on ClusterStor: Connecting Standard Tools to Enable Holistic I/O Performance Analysis , 2018 .

[3] Tao Ke,et al. Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[4] Dhabaleswar K. Panda,et al. A 1 PB/s file system to checkpoint three million MPI tasks , 2013, HPDC.

[5] Scott Klasky,et al. Comprehensive Measurement and Analysis of the User-Perceived I/O Performance in a Production Leadership-Class Storage System , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[6] Robert Latham,et al. Understanding and improving computational science storage access through continuous characterization , 2011, 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).

[7] Wei-keng Liao,et al. I/O analysis and optimization for an AMR cosmology application , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[8] Teng Wang,et al. Characterization and Optimization of Memory-Resident MapReduce on HPC Systems , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[9] S. Byna,et al. Trillion Particles , 120 , 000 cores , and 350 TBs : Lessons Learned from a Hero I / O Run on Hopper , 2013 .

[10] Jianwei Li,et al. Parallel netCDF: A High-Performance Scientific I/O Interface , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[11] Frank B. Schmuck,et al. GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[12] Marianne Winslett,et al. A Multiplatform Study of I/O Behavior on Petascale Supercomputers , 2015, HPDC.

[13] Raghul Gunasekaran,et al. Scientific User Behavior and Data-Sharing Trends in a Petascale File System , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14] Kevin Harms,et al. UMAMI: a recipe for generating meaningful metrics through holistic I/O performance analysis , 2017, PDSW-DISCS@SC.

[15] Robert B. Ross,et al. On the Root Causes of Cross-Application I/O Interference in HPC Storage Systems , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[16] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[17] Teng Wang,et al. An Ephemeral Burst-Buffer File System for Scientific Applications , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18] Devesh Tiwari,et al. GUIDE: A Scalable Information Directory Service to Collect, Federate, and Analyze Logs for Operational Insights into a Leadership HPC Facility , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19] Frank Shorter. Design and Analysis of a Performance Evaluation Standard for Parallel File Systems , 2003 .

[20] Ross Miller,et al. Comparative I/O workload characterization of two leadership class storage clusters , 2015, PDSW '15.

[21] Robert B. Ross,et al. CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[22] John Shalf,et al. Using IOR to analyze the I/O Performance for HPC Platforms , 2007 .

[23] Hui Chen,et al. Enhance parallel input/output with cross-bundle aggregation , 2016, Int. J. High Perform. Comput. Appl..

[24] Alfred Inselberg,et al. Parallel Coordinates: Visual Multidimensional Geometry and Its Applications , 2003, KDIR.

[25] Surendra Byna,et al. DXT: Darshan eXtended Tracing , 2019 .

[26] Cong Xu,et al. Profiling and Improving I/O Performance of a Large-Scale Climate Scientific Application , 2013, 2013 22nd International Conference on Computer Communication and Networks (ICCCN).

[27] Rajeev Thakur,et al. Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[28] Feiyi Wang,et al. Diving into petascale production file systems through large scale profiling and analysis , 2017, PDSW-DISCS@SC.

[29] Bin Dong,et al. Collective I / O Optimizations for Adaptive Mesh Refinement Data Writes on Lustre File System , 2016 .

[30] Andy B. Yoo,et al. Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[31] John Bent,et al. PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.