Using Simulation to Evaluate the Performance of Resilience Strategies at Scale
暂无分享,去创建一个
Torsten Hoefler | Patrick M. Widener | Kurt B. Ferreira | Scott Levy | Bryan Topp | Dorian C. Arnold
[1] Jack J. Dongarra,et al. Algorithm-based diskless checkpointing for fault tolerant matrix operations , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.
[2] James H. Laros,et al. Redundant computing for exascale systems. , 2010 .
[3] Yuan Xie,et al. Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[4] Rolf Riesen,et al. libhashckpt: Hash-Based Incremental Checkpointing Using GPU's , 2011, EuroMPI.
[5] Franck Cappello,et al. HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[6] Ali Pinar,et al. A Simulator for Large-Scale Parallel Computer Architectures , 2010, Int. J. Distributed Syst. Technol..
[7] Robert B. Ross,et al. Modeling a Million-Node Dragonfly Network Using Massively Parallel Discrete-Event Simulation , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.
[8] Seetharami R. Seelam,et al. Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).
[9] Ciprian Dobre,et al. Simulator for fault tolerance in large scale distributed systems , 2010, Proceedings of the 2010 IEEE 6th International Conference on Intelligent Computer Communication and Processing.
[10] Jack Dongarra,et al. Recent Advances in the Message Passing Interface - 17th European MPI Users' Group Meeting, EuroMPI 2010, Stuttgart, Germany, September 12-15, 2010. Proceedings , 2010, EuroMPI.
[11] Patrick M. Widener,et al. Asking the Right Questions: Benchmarking Fault-Tolerant Extreme-Scale Systems , 2013, Euro-Par Workshops.
[12] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[13] James H. Laros,et al. Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[14] Ramesh Subramonian,et al. LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.
[15] Jean-Marc Vincent,et al. A Flexible Checkpoint/Restart Model in Distributed Systems , 2009, PPAM.
[16] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .
[17] Torsten Hoefler,et al. Group Operation Assembly Language - A Flexible Way to Express Collective Communication , 2009, 2009 International Conference on Parallel Processing.
[18] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[19] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[20] Song Jiang,et al. Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[21] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[22] Franck Cappello,et al. Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[23] Laxmikant V. Kalé,et al. Simulation-Based Performance Prediction for Large Parallel Machines , 2005, International Journal of Parallel Programming.
[24] Ron Brightwell,et al. On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance , 2012, 2012 41st International Conference on Parallel Processing.
[25] Stephen L. Scott,et al. Evaluation of fault-tolerant policies using simulation , 2007, 2007 IEEE International Conference on Cluster Computing.
[26] Torsten Hoefler,et al. Characterizing the Influence of System Noise on Large-Scale Applications by Simulation , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[27] Lorenzo Alvisi,et al. An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).
[28] Dejan S. Milojicic,et al. Optimizing Checkpoints Using NVM as Virtual Memory , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[29] Luís Moura Silva,et al. An experimental study about diskless checkpointing , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).
[30] Daniel Marques,et al. Compiler-enhanced incremental checkpointing for OpenMP applications , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[31] A. Lumsdaine,et al. LogGOPSim: simulating large-scale applications in the LogGOPS model , 2010, HPDC '10.
[32] Horst D. Simon. Barriers to Exascale Computing , 2012, VECPAR.
[33] Christian Engelmann,et al. xSim: The extreme-scale simulator , 2011, 2011 International Conference on High Performance Computing & Simulation.
[34] Christine Morin,et al. A hierarchical checkpointing protocol for parallel applications in cluster federations , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[35] Ron Brightwell,et al. Characterizing application sensitivity to OS interference using kernel-level noise injection , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[36] Thomas Hérault,et al. An evaluation of User-Level Failure Mitigation support in MPI , 2012, Computing.
[37] Christine Morin,et al. Hybrid checkpointing for parallel applications in cluster federations , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..
[38] Kengo Nakajima,et al. High Performance Computing for Computational Science - VECPAR 2012 , 2013, Lecture Notes in Computer Science.