Improving the availability of supercomputer job input data using temporal replication

AbstractStorage systems in supercomputers are a major reason for service interruptions. RAID solutions alone cannot provide sufficient protection as 1) growing average disk recovery times make RAID groups increasingly vulnerable to disk failures during reconstruction, and 2) RAID does not help with higher-level faults such failed I/O nodes. This paper presents a complementary approach based on the observation that files in the supercomputer scratch space are typically accessed by batch jobs whose execution can be anticipated. Therefore, we propose to transparently, selectively, and temporarily replicate “active” job input data by coordinating the parallel file system with the batch job scheduler. We have implemented the temporal replication scheme in the popular Lustre parallel file system and evaluated it with real-cluster experiments. Our results show that the scheme allows for fast online data reconstruction, with a reasonably low overall space and I/O bandwidth overhead.

[1]  Michael Luby,et al.  A digital fountain approach to reliable distribution of bulk data , 1998, SIGCOMM '98.

[2]  Andrea C. Arpaci-Dusseau,et al.  Improving file system reliability with I/O shepherding , 2007, SOSP.

[3]  KanevskyArkady,et al.  Are disks the dominant contributor for storage failures , 2008 .

[4]  Jay J. Wylie,et al.  Determining Fault Tolerance of XOR-Based Erasure Codes Efficiently , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[5]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[6]  Edith Cohen,et al.  Replication strategies in unstructured peer-to-peer networks , 2002, SIGCOMM.

[7]  Ira Pramanick,et al.  High Availability , 2001, Int. J. High Perform. Comput. Appl..

[8]  Y. Charlie Hu,et al.  Kosha: A Peer-to-Peer Enhancement for the Network File System , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[9]  Lustre : A Scalable , High-Performance File System Cluster , 2003 .

[10]  Gang Fu,et al.  Performance of Two-Disk Failure-Tolerant Disk Arrays , 2007, IEEE Transactions on Computers.

[11]  Michael Williams,et al.  Replication in the harp file system , 1991, SOSP '91.

[12]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[13]  Edith Cohen,et al.  Replication strategies in unstructured peer-to-peer networks , 2002, SIGCOMM.

[14]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[15]  Arkady Kanevsky,et al.  Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics , 2008, TOS.

[16]  Wu-chun Feng,et al.  The design, implementation, and evaluation of mpiBLAST , 2003 .

[17]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[18]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[19]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[20]  Marianne Winslett,et al.  Active buffering plus compressed migration: an integrated solution to parallel simulations' data transport needs , 2002, ICS '02.

[21]  Jacob R. Lorch,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OSDI '02.

[22]  Rodrigo Rodrigues,et al.  Proceedings of Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two , 2022 .

[23]  James S. Plank,et al.  Small parity-check erasure codes - exploration and observations , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[24]  Stephen L. Scott,et al.  Coupling prefix caching and collective downloads for remote dataset access , 2006, ICS '06.

[25]  Chao Wang,et al.  Optimizing center performance through coordinated data staging, scheduling and recovery , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[26]  Hong Jiang,et al.  PRO: A Popularity-based Multi-threaded Reconstruction Optimization for RAID-Structured Storage Systems , 2007, FAST.

[27]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[28]  Ethan L. Miller,et al.  Evaluation of distributed recovery in large-scale storage systems , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[29]  John Kubiatowicz,et al.  Erasure Coding Vs. Replication: A Quantitative Comparison , 2002, IPTPS.

[30]  Wu-chun Feng,et al.  A Power-Aware Run-Time System for High-Performance Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[31]  Andrea C. Arpaci-Dusseau,et al.  IRON file systems , 2005, SOSP '05.

[32]  Stephen L. Scott,et al.  FreeLoader: Scavenging Desktop Storage Resources for Scientific Data , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[33]  Ali Raza Butt,et al.  Timely offloading of result-data in HPC centers , 2008, ICS '08.

[34]  Andrea C. Arpaci-Dusseau,et al.  Explicit Control in the Batch-Aware Distributed File System , 2004, NSDI.