Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

The emergence of petascale systems and the promise of future exascale systems have reinvigorated the community interest in how to manage failures in such systems and ensure that large applications, lasting several hours or tens of hours, are completed successfully. Most of the existing results for several key mechanisms associated with fault tolerance in high-performance computing (HPC) platforms follow the rollback—recovery approach. Over the last decade, these mechanisms have received a lot of attention from the community with different levels of success. Unfortunately, despite their high degree of optimization, existing approaches do not fit well with the challenging evolutions of large-scale systems. There is room and even a need for new approaches. Opportunities may come from different origins: diskless checkpointing, algorithmic-based fault tolerance, proactive operation, speculative execution, software transactional memory, forward recovery, etc. The contributions of this paper are as follows: (1) we summarize and analyze the existing results concerning the failures in large-scale computers and point out the urgent need for drastic improvements or disruptive approaches for fault tolerance in these systems; (2) we sketch most of the known opportunities and analyze their associated limitations; (3) we extract and express the challenges that the HPC community will have to face for addressing the stringent issue of failures in HPC systems.

[1]  Edsger W. Dijkstra,et al.  Self-stabilizing systems in spite of distributed control , 1974, CACM.

[2]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[3]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[4]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[5]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[6]  Kai Li,et al.  Memory Exclusion: Optimizing the Performance of Checkpointing Systems , 1999, Softw. Pract. Exp..

[7]  Christian Engelmann,et al.  Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors , 2002 .

[8]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[9]  R. Vilalta,et al.  Providing Persistent and Consistent Resources through Event Log Analysis and Predictions for Large-scale Computing Systems , 2002 .

[10]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[11]  Daniel Marques,et al.  C3: A System for Automating Application-Level Checkpointing of MPI Programs , 2003, LCPC.

[12]  G. R. Liu,et al.  1013 Mesh Free Methods : Moving beyond the Finite Element Method , 2003 .

[13]  Charng-da Lu,et al.  Scalable Diskless Checkpointing for Large Parallel Systems , 2005 .

[14]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[15]  Stéphane Genaud,et al.  P2P-MPI: A Peer-to-Peer Framework for Robust Execution of Message Passing Parallel Programs on Grids , 2007, Journal of Grid Computing.

[16]  Laxmikant V. Kalé,et al.  Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.

[17]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[18]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[19]  Zhiling Lan,et al.  Fault-Driven Re-Scheduling For Improving System-level Fault Resilience , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[20]  Zizhong Chen Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[21]  F. Mueller,et al.  Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Fatiha Bouabache,et al.  Hierarchical Replication Techniques to Ensure Checkpoint Storage Reliability in Grid Environment , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[23]  Chao Wang,et al.  A tunable holistic resiliency approach for high-performance computing systems , 2009, PPoPP '09.

[24]  George Bosilca,et al.  Redesigning the message logging model for high performance , 2010, Concurr. Comput. Pract. Exp..