Design and Evaluation of MPI File Domain Partitioning Methods under Extent-Based File Locking Protocol

MPI collective I/O has been an effective method for parallel shared-file access and maintaining the canonical orders of structured data in files. Its implementation commonly uses a two-phase I/O strategy that partitions a file into disjoint file domains, assigns each domain to a unique process, redistributes the I/O data based on their locations in the domains, and has each process perform I/O for the assigned domain. The partitioning quality determines the maximal performance achievable by the underlying file system, as the shared-file I/O has long been impeded by the cost of file system's data consistency control, particularly due to the conflicted locks. This paper proposes a few file domain partitioning methods designed to reduce lock conflicts under the extent-based file locking protocol. Experiments from four I/O benchmarks on the IBM GPFS and Lustre parallel file systems show that the partitioning method producing minimum lock conflicts wins the highest performance. The benefit of removing conflicted locks can be so significant that more than thirty times of write bandwidth differences are observed between the best and worst methods.

[1]  Robert Latham,et al.  Implementing MPI-IO atomic mode without file system support , 2005, CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005..

[2]  Song Jiang,et al.  Making resonance a common case: A high-performance implementation of collective I/O on parallel file systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[3]  David A. Bader,et al.  Hierarchical Data Format , 2011, Encyclopedia of Parallel Computing.

[4]  Alok N. Choudhary,et al.  Improved parallel I/O via a two-phase run-time access strategy , 1993, CARN.

[5]  C. Law,et al.  Direct Numerical Simulations of Turbulent Lean Premixed Combustion. , 2006 .

[6]  Wei-keng Liao,et al.  Dynamically adapting file domain partitioning methods for collective I/O based on underlying parallel file system locking protocols , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Lustre : A Scalable , High-Performance File System Cluster , 2003 .

[8]  Bin Jia,et al.  MPI-IO/GPFS, an Optimized Implementation of MPI-IO on Top of GPFS , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[9]  Bill Nitzberg,et al.  Collective buffering: Improving parallel I/O performance , 1997, Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183).

[10]  B. Fryxell,et al.  FLASH: An Adaptive Mesh Hydrodynamics Code for Modeling Astrophysical Thermonuclear Flashes , 2000 .

[11]  Wei-keng Liao,et al.  Scalable high-level caching for parallel I/O , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[12]  Wei-keng Liao,et al.  An Implementation and Evaluation of Client-Side File Caching for MPI-IO , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[13]  Tao Yang,et al.  The Panasas ActiveScale Storage Cluster - Delivering Scalable High Bandwidth Storage , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[14]  Marianne Winslett,et al.  Improving MPI-IO output performance with active buffering plus threads , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[15]  David Kotz,et al.  Disk-directed I/O for MIMD multiprocessors , 1994, OSDI '94.

[16]  Bin Zhou,et al.  Scalable Performance of the Panasas Parallel File System , 2008, FAST.

[17]  Robert Latham,et al.  High performance file I/O for the Blue Gene/L supercomputer , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[18]  Rob VanderWijngaart,et al.  NAS Parallel Benchmarks I/O Version 2.4. 2.4 , 2002 .

[19]  Rajeev Thakur,et al.  Users guide for ROMIO: A high-performance, portable MPI-IO implementation , 1997 .

[20]  Ieee Standards Board System application program interface (API) (C language) , 1990 .

[21]  Alice E. Koniges,et al.  Towards a High-Performance Implementation of MPI-IO on Top of GPFS , 2000, Euro-Par.

[22]  Robert B. Ross,et al.  A New Flexible MPI Collective I/O Implementation , 2006, 2006 IEEE International Conference on Cluster Computing.

[23]  Marianne Winslett,et al.  Server-Directed Collective I/O in Panda , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[24]  Wei-keng Liao,et al.  DAChe: Direct Access Cache System for Parallel I/O , 2005 .