I/O performance of the Santos Dumont supercomputer

In this article, we study the I/O performance of the Santos Dumont supercomputer, since the gap between processing and data access speeds causes many applications to spend a large portion of their execution on I/O operations. For a large-scale expensive supercomputer, it is essential to ensure applications achieve the best I/O performance to promote efficient usage. We monitor a week of the machine’s activity and present a detailed study on the obtained metrics, aiming at providing an understanding of its workload. From experiences with one numerical simulation, we identified large I/O performance differences between the MPI implementations available to users. We investigated the phenomenon and narrowed it down to collective I/O operations with small request sizes. For these, we concluded that the customized MPI implementation by the machine’s vendor (used by more than 20% of the jobs) presents the worst performance. By investigating the issue, we provide information to help improve future MPI-IO collective write implementations and practical guidelines to help users and steer future system upgrades. Finally, we discuss the challenge of describing applications I/O behavior without depending on information from users. That allows for identifying the application’s I/O bottlenecks and proposing ways of improving its I/O performance. We propose a methodology to do so, and use GROMACS, the application with the largest number of jobs in 2017, as a case study.

[1]  Olivier Simonin,et al.  High Performance Computing (HPC) for the Fluidization of Particle-Laden Reactive Flows , 2013 .

[2]  Bruno Raffin,et al.  A Flexible Framework for Asynchronous in Situ and in Transit Analytics for Scientific Simulations , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[3]  Carla Schlatter Ellis,et al.  File-Access Characteristics of Parallel Scientific Workloads , 1996, IEEE Trans. Parallel Distributed Syst..

[4]  Robert B. Ross,et al.  Small-file access in parallel file systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[5]  Robert Latham,et al.  Understanding and improving computational science storage access through continuous characterization , 2011, 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).

[6]  Jean Luca Bez,et al.  A Checkpoint of Research on Parallel I/O for High-Performance Computing , 2018, ACM Comput. Surv..

[7]  Scott Klasky,et al.  Characterizing output bottlenecks in a supercomputer , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Ron Brightwell A Comparison of Three MPI Implementations for Red Storm , 2005, PVM/MPI.

[9]  村井 均,et al.  NAS Parallel Benchmarks によるHPFの評価 , 2006 .

[10]  Ron Brightwell A New MPI Implementation for Cray SHMEM , 2004, PVM/MPI.

[11]  Jean Luca Bez,et al.  Collective I/O Performance on the Santos Dumont Supercomputer , 2018, 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).

[12]  Nick Feltovich Nonparametric Tests of Differences in Medians: Comparison of the Wilcoxon–Mann–Whitney and Robust Rank-Order Tests , 2003 .

[13]  Robert B. Ross,et al.  CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[14]  Robert Gentleman,et al.  Software for Computing and Annotating Genomic Ranges , 2013, PLoS Comput. Biol..

[15]  Galen M. Shipman,et al.  Workload characterization of a leadership class storage cluster , 2010, 2010 5th Petascale Data Storage Workshop (PDSW '10).

[16]  Marianne Winslett,et al.  A Multiplatform Study of I/O Behavior on Petascale Supercomputers , 2015, HPDC.

[17]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[18]  Peter M. Kasson,et al.  GROMACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit , 2013, Bioinform..

[19]  LarssonPer,et al.  GROMACS 4.5 , 2013 .

[20]  Feng Wang,et al.  File System Workload Analysis For Large Scale Scientific Com puting Applications , 2004 .

[21]  Brian Vinter,et al.  A Comparison of Three MPI Implementations , 2004 .

[22]  Francieli Zanon Boito,et al.  Improving Atmospheric Model Performance on a Multi-Core Cluster System , 2012 .

[23]  Yifeng Zhu,et al.  A study of self-similarity in parallel I/O workloads , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[24]  Berk Hess,et al.  GROMACS—the road ahead , 2011 .

[25]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.