PLFS: a checkpoint filesystem for parallel applications

Parallel applications running across thousands of processors must protect themselves from inevitable system failures. Many applications insulate themselves from failures by checkpointing. For many applications, checkpointing into a shared single file is most convenient. With such an approach, the size of writes are often small and not aligned with file system boundaries. Unfortunately for these applications, this preferred data layout results in pathologically poor performance from the underlying file system which is optimized for large, aligned writes to non-shared files. To address this fundamental mismatch, we have developed a virtual parallel log structured file system, PLFS. PLFS remaps an application's preferred data layout into one which is optimized for the underlying file system. Through testing on PanFS, Lustre, and GPFS, we have seen that this layer of indirection and reorganization can reduce checkpoint time by an order of magnitude for several important benchmarks and real applications without any application modification.

[1]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[2]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[3]  James Lau,et al.  File System Design for an NFS File Server Appliance , 1994, USENIX Winter.

[4]  Jeffrey F. Naughton,et al.  Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[5]  Jim Zelenka,et al.  The Scotch parallel storage systems , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[6]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[7]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[8]  Rajeev Thakur,et al.  Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[9]  Kai Li,et al.  Memory Exclusion: Optimizing the Performance of Checkpointing Systems , 1999, Softw. Pract. Exp..

[10]  Douglas Thain,et al.  Bypass: a tool for building split execution systems , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[11]  Erez Zadok,et al.  FIST: a language for stackable file systems , 2000, OPSR.

[12]  Andrea C. Arpaci-Dusseau,et al.  Implicit coscheduling: coordinated scheduling with implicit information in distributed systems , 2001, TOCS.

[13]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[14]  Jianwei Li,et al.  Parallel netCDF: A High-Performance Scientific I/O Interface , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[15]  Remzi H. Arpaci-Dusseau,et al.  Run-time adaptation in river , 2003, TOCS.

[16]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[17]  Brent Welch,et al.  Managing Scalability in Object Storage Systems for HPC Linux Clusters , 2004, MSST.

[18]  Tyce T. McLarty,et al.  Parallel file system testing for the lunatic fringe: the care and feeding of restless I/O power users , 2005, 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST'05).

[19]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[20]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[21]  Samuel Lang,et al.  GIGA+: scalable directories for shared file systems , 2007, PDSW '07.

[22]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[23]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[24]  Jeffrey S. Vetter,et al.  Exploiting Lustre File Joining for Effective Collective IO , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[25]  M. Polte,et al.  Fast log-based concurrent writing of checkpoints , 2008, 2008 3rd Petascale Data Storage Workshop.

[26]  Sudharshan S. Vazhkudai,et al.  Aggregate Memory as an Intermediate Checkpoint Storage Device , 2008 .

[27]  P. Nowoczynski,et al.  Zest Checkpoint storage system for large supercomputers , 2008, 2008 3rd Petascale Data Storage Workshop.

[28]  Karsten Schwan,et al.  Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS) , 2008, CLADE '08.

[29]  Bin Zhou,et al.  Scalable Performance of the Panasas Parallel File System , 2008, FAST.

[30]  Matei Ripeanu,et al.  stdchk: A Checkpoint Storage System for Desktop Grid Computing , 2007, 2008 The 28th International Conference on Distributed Computing Systems.

[31]  Robert B. Ross,et al.  Coordinating government funding of file system and I/O research through the high end computing university research activity , 2009, OPSR.

[32]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..