Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart

Speedup models are powerful analytical tools for evaluating and predicting the performance of parallel applications. Unfortunately, the well-known speedup models like Amdahl's law and Gustafson's law do not take reliability into consideration and therefore cannot accurately account for application performance in the presence of failures. In this study, we enhance Amdahl's law and Gustafson's law by considering the impact of failures and the effect of coordinated checkpointing/restart. Unlike existing analytical studies relying on Exponential failure distribution alone, in this work we consider both Exponential and Weibull failure distributions in the construction of our reliability-aware speedup models. The derived reliability-aware models are validated through trace-based simulations under a variety of parameter settings. Our trace-based simulations demonstrate these models can effectively quantify failure impact on application speedup. Moreover, we present two case studies to illustrate the use of these reliability-aware speedup models.

[1]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[2]  Marco Aurélio Amaral Henriques,et al.  Speedup and scalability analysis of Master-Slave applications on large heterogeneous clusters , 2007, J. Parallel Distributed Comput..

[3]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[4]  Zhiling Lan,et al.  Adaptive Fault Management of Parallel Applications for High-Performance Computing , 2008, IEEE Transactions on Computers.

[5]  Franck Cappello,et al.  SPBC: Leveraging the characteristics of MPI HPC applications for scalable checkpointing , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[6]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[7]  James S. Plank,et al.  Experimental assessment of workstation failures and their impact on checkpointing systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[8]  Zhiling Lan,et al.  Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.

[9]  Ron A. Oldfield Lightweight storage and overlay networks for fault tolerance. , 2006 .

[10]  Kishor S. Trivedi,et al.  Minimizing completion time of a program by checkpointing and rejuvenation , 1996, SIGMETRICS '96.

[11]  Horst Rinne,et al.  The Weibull Distribution: A Handbook , 2008 .

[12]  Christopher D. Carothers,et al.  An analysis of clustered failures on large supercomputing systems , 2009, J. Parallel Distributed Comput..

[13]  Mark S. Squillante,et al.  Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.

[14]  James S. Plank,et al.  Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..

[15]  Rong Ge,et al.  Power-Aware Speedup , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[16]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[17]  Seetharami R. Seelam,et al.  Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[18]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.

[19]  Sarala Arunagiri,et al.  Opportunistic Checkpoint Intervals to Improve System Performance , 2008 .

[20]  Zhiling Lan,et al.  Reliability-aware scalability models for high performance computing , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[21]  Ravishankar K. Iyer,et al.  Modeling coordinated checkpointing for large-scale supercomputers , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[22]  Mukesh Singhal,et al.  Checkpointing with mutable checkpoints , 2003, Theor. Comput. Sci..

[23]  Dilma Da Silva,et al.  Alleviating scalability issues of checkpointing protocols , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  Vipin Kumar,et al.  Analysis of scalability of parallel algorithms and architectures: a survey , 1991, ICS '91.

[25]  C. Murray Woodside,et al.  Evaluating the scalability of distributed systems , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.

[26]  Peter H. Beckman,et al.  Analyzing Checkpointing Trends for Applications on the IBM Blue Gene/P System , 2009, 2009 International Conference on Parallel Processing Workshops.

[27]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[28]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Zhiling Lan,et al.  A fast restart mechanism for checkpoint/recovery protocols in networked environments , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[30]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[31]  Zhiling Lan,et al.  Filtering log data: Finding the needles in the Haystack , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[32]  Sarah Ellen Michalak,et al.  Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[33]  Christian Engelmann,et al.  Blue Gene/L Log Analysis and Time to Interrupt Estimation , 2009, 2009 International Conference on Availability, Reliability and Security.

[34]  Ming Wu,et al.  Performance under failures of high-end computing , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[35]  Meeta Sharma Gupta,et al.  Performance implications of periodic checkpointing on large-scale cluster systems , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[36]  Thomas Hérault,et al.  Improved message logging versus improved coordinated checkpointing for fault tolerant MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[37]  Xian-He Sun,et al.  Optimizing HPC Fault-Tolerant Environment: An Analytical Approach , 2010, 2010 39th International Conference on Parallel Processing.

[38]  Franck Cappello,et al.  Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[39]  W YoungJohn A first order approximation to the optimum checkpoint interval , 1974 .

[40]  Fabrizio Petrini,et al.  System-level fault-tolerance in large-scale parallel machines with buffered coscheduling , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[41]  Chao Wang,et al.  Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[42]  Charng-Da Lu,et al.  Big Systems and Big Reliability Challenges , 2003, PARCO.

[43]  Stephen L. Scott,et al.  Reliability of a System of k Nodes for High Performance Computing Applications , 2010, IEEE Transactions on Reliability.

[44]  Robert Latham,et al.  I/O performance challenges at leadership scale , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[45]  Hsien-Hsin S. Lee,et al.  Extending Amdahl's Law for Energy-Efficient Computing in the Many-Core Era , 2008, Computer.

[46]  John T. Daly,et al.  Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters , 2010, HPDC '10.

[47]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[48]  Stephen L. Scott,et al.  An optimal checkpoint/restart model for a large scale high performance computing system , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[49]  Vivek Sarkar,et al.  Software challenges in extreme scale systems , 2009 .

[50]  Narayan Desai,et al.  Co-analysis of RAS Log and Job Log on Blue Gene/P , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[51]  Lionel M. Ni,et al.  Another view on parallel speedup , 1990, Proceedings SUPERCOMPUTING '90.

[52]  Henri Casanova,et al.  Checkpointing strategies for parallel jobs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[53]  Larry Rudolph,et al.  Cooperative checkpointing: a robust approach to large-scale systems reliability , 2006, ICS '06.

[54]  Victor F. Nicola,et al.  Checkpointing and the modeling of program execution time , 1994 .

[55]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).