When is the Right Time to Start the Fault Tolerance Protection?

In High Performance Computing, Fault Tolerance (FT) becomes a primary concern due to the constant growing and continuous aging of hardware components, which rise failures probability. Failures produce performance degradation to the environment and affect significantly users expected execution time. Rollback-Recovery protocols represent a fundamental component to protect and restore users parallel application execution, although this protection comes with an overhead. This paper proposes a First Protection Point model, which determines the starting point to introduce FT protection gaining benefits in terms of total execution time including failures. A characterization of Rollback-Recovery protocols applied on parallel applications is performed, to obtain key factors for the model design. This model can help users determine which checkpoints can be removed from the application execution when they are used for FT protection purposes, reducing the overhead and at the same time keeping high availability. An analytic model evaluation is developed to show the inflexion point where FT protection starts to provide benefits for users. Finally, three experimental environments are setup, using two private clusters and a public cluster configured in a well-known cloud Amazon EC2. A coordinated checkpoint facility is applied on NAS benchmark applications such as: CG, BT and LU to evaluate the proposed model, obtaining overhead impact reduction for provided Fault Tolerance.

[1]  Emilio Luque,et al.  What is Missing in Current Checkpoint Interval Models? , 2011, 2011 31st International Conference on Distributed Computing Systems.

[2]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[3]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[4]  Bran Selic,et al.  A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.

[5]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[6]  Thomas Hérault,et al.  MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI , 2006, Int. J. High Perform. Comput. Appl..

[7]  Gene Cooperman,et al.  DMTCP: Transparent checkpointing for cluster computations and the desktop , 2007, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[8]  B R de Supinski,et al.  Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System , 2010 .

[9]  Franck Cappello,et al.  Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model , 2017, IEEE Transactions on Parallel and Distributed Systems.

[10]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[11]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[13]  Emilio Luque,et al.  Hybrid Message Logging. Combining advantages of Sender-based and Receiver-based Approaches , 2014, ICCS.

[14]  Emilio Luque,et al.  Fault tolerance at system level based on RADIC architecture , 2015, J. Parallel Distributed Comput..

[15]  Nuria Losada,et al.  Resilient MPI applications using an application-level checkpointing framework and ULFM , 2016, The Journal of Supercomputing.

[16]  Emilio Luque,et al.  Parallel Application Signature for Performance Analysis and Prediction , 2015, IEEE Transactions on Parallel and Distributed Systems.

[17]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[18]  Thomas Hérault,et al.  Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization , 2013, Euro-Par.