Using Simulation to Evaluate the Performance of Resilience Strategies at Scale

Fault-tolerance has been identified as a major challenge for future extreme-scale systems. Current predictions suggest that, as systems grow in size, failures will occur more frequently. Because increases in failure frequency reduce the performance and scalability of these systems, significant effort has been devoted to developing and refining resilience mechanisms to mitigate the impact of failures. However, effective evaluation of these mechanisms has been challenging. Current systems are smaller and have significantly different architectural features (e.g., interconnect, persistent storage) than we expect to see in next-generation systems. To overcome these challenges, we propose the use of simulation. Simulation has been shown to be an effective tool for investigating performance characteristics of applications on future systems. In this work, we: identify the set of system characteristics that are necessary for accurate performance prediction of resilience mechanisms for HPC systems and applications; demonstrate how these system characteristics can be incorporated into an existing large-scale simulator; and evaluate the predictive performance of our modified simulator. We also describe how we were able to optimize the simulator for large temporal and spatial scales—allowing the simulator to run 4x faster and use over 100x less memory.

[1]  Jack J. Dongarra,et al.  Algorithm-based diskless checkpointing for fault tolerant matrix operations , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[2]  James H. Laros,et al.  Redundant computing for exascale systems. , 2010 .

[3]  Yuan Xie,et al.  Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[4]  Rolf Riesen,et al.  libhashckpt: Hash-Based Incremental Checkpointing Using GPU's , 2011, EuroMPI.

[5]  Franck Cappello,et al.  HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[6]  Ali Pinar,et al.  A Simulator for Large-Scale Parallel Computer Architectures , 2010, Int. J. Distributed Syst. Technol..

[7]  Robert B. Ross,et al.  Modeling a Million-Node Dragonfly Network Using Massively Parallel Discrete-Event Simulation , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[8]  Seetharami R. Seelam,et al.  Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[9]  Ciprian Dobre,et al.  Simulator for fault tolerance in large scale distributed systems , 2010, Proceedings of the 2010 IEEE 6th International Conference on Intelligent Computer Communication and Processing.

[10]  Jack Dongarra,et al.  Recent Advances in the Message Passing Interface - 17th European MPI Users' Group Meeting, EuroMPI 2010, Stuttgart, Germany, September 12-15, 2010. Proceedings , 2010, EuroMPI.

[11]  Patrick M. Widener,et al.  Asking the Right Questions: Benchmarking Fault-Tolerant Extreme-Scale Systems , 2013, Euro-Par Workshops.

[12]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[15]  Jean-Marc Vincent,et al.  A Flexible Checkpoint/Restart Model in Distributed Systems , 2009, PPAM.

[16]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[17]  Torsten Hoefler,et al.  Group Operation Assembly Language - A Flexible Way to Express Collective Communication , 2009, 2009 International Conference on Parallel Processing.

[18]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[19]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[20]  Song Jiang,et al.  Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[21]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[22]  Franck Cappello,et al.  Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[23]  Laxmikant V. Kalé,et al.  Simulation-Based Performance Prediction for Large Parallel Machines , 2005, International Journal of Parallel Programming.

[24]  Ron Brightwell,et al.  On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance , 2012, 2012 41st International Conference on Parallel Processing.

[25]  Stephen L. Scott,et al.  Evaluation of fault-tolerant policies using simulation , 2007, 2007 IEEE International Conference on Cluster Computing.

[26]  Torsten Hoefler,et al.  Characterizing the Influence of System Noise on Large-Scale Applications by Simulation , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Lorenzo Alvisi,et al.  An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[28]  Dejan S. Milojicic,et al.  Optimizing Checkpoints Using NVM as Virtual Memory , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[29]  Luís Moura Silva,et al.  An experimental study about diskless checkpointing , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).

[30]  Daniel Marques,et al.  Compiler-enhanced incremental checkpointing for OpenMP applications , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[31]  A. Lumsdaine,et al.  LogGOPSim: simulating large-scale applications in the LogGOPS model , 2010, HPDC '10.

[32]  Horst D. Simon Barriers to Exascale Computing , 2012, VECPAR.

[33]  Christian Engelmann,et al.  xSim: The extreme-scale simulator , 2011, 2011 International Conference on High Performance Computing & Simulation.

[34]  Christine Morin,et al.  A hierarchical checkpointing protocol for parallel applications in cluster federations , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[35]  Ron Brightwell,et al.  Characterizing application sensitivity to OS interference using kernel-level noise injection , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  Thomas Hérault,et al.  An evaluation of User-Level Failure Mitigation support in MPI , 2012, Computing.

[37]  Christine Morin,et al.  Hybrid checkpointing for parallel applications in cluster federations , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[38]  Kengo Nakajima,et al.  High Performance Computing for Computational Science - VECPAR 2012 , 2013, Lecture Notes in Computer Science.