How file access patterns influence interference among cluster applications

On large-scale clusters, tens to hundreds of applications can simultaneously access a parallel file system, leading to contention and in its wake to degraded application performance. However, the degree of interference depends on the specific file access pattern. On the basis of synchronized time-slice profiles, we compare the interference potential of different file access patterns. We consider both micro-benchmarks, to study the effects of certain patterns in isolation, and realistic applications to gauge the severity of such interference under production conditions. In particular, we found that writing large files simultaneously with small files can slow down the latter at small chunk sizes but the former at larger chunk sizes. We further show that such effects can seriously affect the runtime of real applications-up to a factor of five in one instance. In the future, both our insights and profiling techniques can be used to automatically classify the interference potential between applications and to adjust scheduling decisions accordingly.

[1]  Michael M. Resch,et al.  Towards I/O analysis of HPC systems and a generic architecture to collect access patterns , 2012, Computer Science - Research and Development.

[2]  David A Dillow,et al.  Lessons Learned in Deploying the World’s Largest Scale Lustre File System , 2010 .

[3]  James W. Hurrell The CommuniTy earTh SySTem model , 2013 .

[4]  Felix Wolf,et al.  Capturing inter-application interference on clusters , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[5]  Scott Klasky,et al.  Characterizing output bottlenecks in a supercomputer , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Robert Latham,et al.  I/O performance challenges at leadership scale , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[7]  Karsten Schwan,et al.  Six degrees of scientific data: reading patterns for extreme scale science IO , 2011, HPDC '11.

[8]  Thomas Ludwig,et al.  Bottleneck Detection in Parallel File Systems with Trace-Based Performance Monitoring , 2008, Euro-Par.

[9]  Robert Latham,et al.  Understanding and improving computational science storage access through continuous characterization , 2011, MSST.

[10]  Matsuoka Satoshi,et al.  Environment Matters: How Competition for I/O among Applications Degrades their Performance , 2013, ARC 2013.

[11]  Mariana Vertenstein,et al.  An application-level parallel I/O library for Earth system models , 2012, Int. J. High Perform. Comput. Appl..

[12]  John Shalf,et al.  Using IOR to analyze the I/O Performance for HPC Platforms , 2007 .

[13]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[14]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[15]  Leonid Oliker,et al.  Investigation of leading HPC I/O performance using a scientific-application derived benchmark , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[16]  Aleksandar Jemcov,et al.  OpenFOAM: A C++ Library for Complex Physics Simulations , 2007 .

[17]  Jack Dongarra,et al.  ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[18]  Thomas Ludwig,et al.  Towards Self-optimization in HPC I/O , 2013, ISC.

[19]  W. Collins,et al.  The Community Earth System Model: A Framework for Collaborative Research , 2013 .

[20]  Jeffrey S. Vetter,et al.  Performance characterization and optimization of parallel I/O on the Cray XT , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[21]  Leonid Oliker,et al.  Parallel I/O performance: From events to ensembles , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[22]  B. Fryxell,et al.  FLASH: An Adaptive Mesh Hydrodynamics Code for Modeling Astrophysical Thermonuclear Flashes , 2000 .

[23]  Karsten Schwan,et al.  Managing Variability in the IO Performance of Petascale Storage Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  Thomas Ludwig,et al.  Performance Evaluation of the PVFS2 Architecture , 2007, 15th EUROMICRO International Conference on Parallel, Distributed and Network-Based Processing (PDP'07).