Towards a High Performance Implementation of MPI-IO on the Lustre File System

Lustre is becoming an increasingly important file system for large-scale computing clusters. The problem is that many data-intensive applications use MPI-IO for their I/O requirements, and it has been well documented that MPI-IO performs poorly in a Lustre file system environment. However, the reasons for such poor performance are not currently well understood. We believe that the primary reason for poor performance is that the assumptions underpinning most of the parallel I/O optimizations implemented in MPI-IO do not hold in a Lustre environment. Perhaps the most important assumption that appears to be incorrect is that optimal performance is obtained by performing large, contiguous I/O operations. Our research suggests that this is often the worst approach to take in a Lustre file system. In fact, we found that the best performance is sometimes achieved when each process performs a series of smaller, non-contiguous I/O requests. In this paper, we provide experimental results showing that such assumptions do not apply in Lustre, and explore new approaches that appear to provide significantly better performance.

[1]  Rajeev Thakur,et al.  Optimizing noncontiguous accesses in MPI-IO , 2002, Parallel Comput..

[2]  Rajeev Thakur,et al.  On implementing MPI-IO portably and with high performance , 1999, IOPADS '99.

[3]  Jeremy S. Logan,et al.  Using Object Based Files for High Performance Parallel I/O , 2007, 2007 4th IEEE Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications.

[4]  Robert B. Ross,et al.  Efficient structured data access in parallel file systems , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[5]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[6]  Marianne Winslett,et al.  RFS: efficient and flexible remote file access for MPI-IO , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[7]  Rajeev Thakur,et al.  Users guide for ROMIO: A high-performance, portable MPI-IO implementation , 1997 .

[8]  Jeffrey S. Vetter,et al.  Exploiting Lustre File Joining for Effective Collective IO , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[9]  Alex Rapaport,et al.  Mpi-2: extensions to the message-passing interface , 1997 .

[10]  Wei-keng Liao,et al.  DAChe: Direct Access Cache System for Parallel I/O , 2005 .

[11]  E. Lusk,et al.  An abstract-device interface for implementing portable parallel-I/O interfaces , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).

[12]  Robert B. Ross,et al.  Noncontiguous I/O accesses through MPI-IO , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[13]  Rajeev Thakur,et al.  Users Guide for ROMIO: A High-Performance , 1997 .

[14]  Florin Isaila,et al.  View I/O: improving the performance of non-contiguous I/O , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[15]  Francine D. Berman,et al.  The Teragrid Project , 2002 .

[16]  Rajeev Thakur,et al.  Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[17]  Wei-keng Liao,et al.  An Implementation and Evaluation of Client-Side File Caching for MPI-IO , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[18]  Mahmut T. Kandemir,et al.  Improving MPI Independent Write Performance Using A Two-Stage Write-Behind Buffering Method , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[19]  Mark R. Fahey,et al.  I/O performance on a massively parallel Cray XT3/XT4 , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.