Iteration Based Collective I/O Strategy for Parallel I/O Systems

MPI collective I/O is a widely used I/O method that helps data-intensive scientific applications gain better I/O performance. However, it has been observed that existing collective I/O strategies do not perform well due to the access contention problem. Existing collective I/O optimization strategies mainly focus on the I/O phase efficiency and ignore the shuffle cost that may limit the potential of their performance improvement. We observe that as the size of I/O becomes larger, one I/O operation from the upper application would be separated into several iterations to complete. So, I/O requests in each file domain do not necessarily issue to the parallel file system simultaneously unless they are carried out within the same iteration step. Based on that observation, this paper proposes a new collective I/O strategy that reorganizes I/O requests within each file domain instead of coordinating requests across file domains, such that we can eliminate access contentions without introducing extra shuffle cost between aggregators and computing processes. Using benchmark workloads IOR, we evaluate our new strategy and compare with the conventional one. The proposed strategy achieves up to 47%-63% I/O bandwidth improvement compared to the existing ROMIO collective I/O strategy.

[1]  Rob VanderWijngaart,et al.  NAS Parallel Benchmarks I/O Version 2.4. 2.4 , 2002 .

[2]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[3]  Robert Latham,et al.  Using Subfiling to Improve Programming Flexibility and Performance of Parallel Shared-file I/O , 2009, 2009 International Conference on Parallel Processing.

[4]  Tao Ke,et al.  Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[5]  Andrew J. Hutton,et al.  Lustre: Building a File System for 1,000-node Clusters , 2003 .

[6]  Wei-keng Liao,et al.  Delegation-Based I/O Mechanism for High Performance Computing Systems , 2012, IEEE Transactions on Parallel and Distributed Systems.

[7]  Song Jiang,et al.  Making resonance a common case: A high-performance implementation of collective I/O on parallel file systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[8]  David J. DeWitt,et al.  Scientific data management in the coming decade , 2005, SGMD.

[9]  Phillip M. Dickens,et al.  Y-lib: a user level library to increase the performance of MPI-IO in a lustre file system environment , 2009, HPDC '09.

[10]  Limin Xiao,et al.  A New File-Specific Stripe Size Selection Method for Highly Concurrent Data Access , 2012, 2012 ACM/IEEE 13th International Conference on Grid Computing.

[11]  Gerd Heber,et al.  An overview of the HDF5 technology suite and its applications , 2011, AD '11.

[12]  Dhabaleswar K. Panda,et al.  Scalable Earthquake Simulation on Petascale Supercomputers , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Wei-keng Liao,et al.  Dynamically adapting file domain partitioning methods for collective I/O based on underlying parallel file system locking protocols , 2008, HiPC 2008.

[14]  Rajeev Thakur,et al.  LACIO: A New Collective I/O Strategy for Parallel I/O Systems , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[15]  Yong Chen,et al.  Hierarchical I/O Scheduling for Collective I/O , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[16]  Scott Klasky,et al.  Terascale direct numerical simulations of turbulent combustion using S3D , 2008 .

[17]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[18]  Anthony J. G. Hey,et al.  Jim Gray on eScience: a transformed scientific method , 2009, The Fourth Paradigm.

[19]  Seung Ryoul Maeng,et al.  Reducing communication costs in collective I/O in multi-core cluster systems with non-exclusive scheduling , 2011, The Journal of Supercomputing.

[20]  Rajeev Thakur,et al.  Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[21]  John Shalf,et al.  Using IOR to analyze the I/O Performance for HPC Platforms , 2007 .

[22]  Wei-keng Liao,et al.  Design and Evaluation of MPI File Domain Partitioning Methods under Extent-Based File Locking Protocol , 2011, IEEE Transactions on Parallel and Distributed Systems.

[23]  Ray W. Grout,et al.  EDO: Improving Read Performance for Scientific Applications through Elastic Data Organization , 2011, 2011 IEEE International Conference on Cluster Computing.

[24]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[25]  Xian-He Sun,et al.  Cost-intelligent application-specific data layout optimization for parallel file systems , 2013, Cluster Computing.

[26]  Song Jiang,et al.  InterferenceRemoval: removing interference of disk access for MPI programs through data replication , 2010, ICS '10.