ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability

Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than one hour. Malleable applications, where the number of processors on which the applications execute can be changed during executions, can make use of their malleability to better tolerate high failure rates. We present AdFT, an adaptive fault tolerance framework for long running malleable applications to maximize application performance in the presence of failures. AdFT framework includes cost models for evaluating the benefits of various fault tolerance actions including checkpointing, live-migration and rescheduling, and runtime decisions for dynamically selecting the fault tolerance actions at different points of application execution to maximize performance. Simulations with real and synthetic failure traces show that our approach outperforms existing fault tolerance mechanisms for malleable applications yielding up to 23% improvement in application performance, and is effective even for petascale systems and beyond.

[1]  Laxmikant V. Kalé,et al.  Overcoming scaling challenges in biomolecular simulations across multiple platforms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[2]  Nithin Nakka,et al.  Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[3]  Fabrizio Petrini,et al.  System-level fault-tolerance in large-scale parallel machines with buffered coscheduling , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[4]  Robert Jacob,et al.  Toward an ultra-high resolution community climate system model for the BlueGene platform , 2007 .

[5]  Leonid Oliker,et al.  Scientific Application Performance on Candidate PetaScale Platforms , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[6]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[7]  Franck Cappello,et al.  Checkpointing vs. Migration for Post-Petascale Supercomputers , 2010, 2010 39th International Conference on Parallel Processing.

[8]  Rajeev Thakur,et al.  A Meta-Learning Failure Predictor for Blue Gene/L Systems , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[9]  Laxmikant V. Kalé,et al.  Architectural Constraints to Attain 1 Exaflop/s for Three Scientific Application Classes , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[10]  Chao Wang,et al.  Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[12]  Becky Verastegui,et al.  Proceedings of the 2007 ACM/IEEE conference on Supercomputing , 2007, HiPC 2007.

[13]  Sathish S. Vadhiyar,et al.  SRS: A Framework for Developing Malleable and Migratable Parallel Applications for Distributed Systems , 2003, Parallel Process. Lett..

[14]  Zhiling Lan,et al.  Adaptive Fault Management of Parallel Applications for High-Performance Computing , 2008, IEEE Transactions on Computers.

[15]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[16]  James S. Plank,et al.  An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance , 1997 .

[17]  Gene Cooperman,et al.  DMTCP: Transparent checkpointing for cluster computations and the desktop , 2007, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[18]  Nicholas J. Wright,et al.  WRF nature run , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[19]  James S. Plank An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and , 1997 .

[20]  L. Kalé,et al.  Towards Petascale Cosmological Simulations with ChaNGa , 2007 .