Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart
暂无分享,去创建一个
[1] Daniel P. Siewiorek,et al. Error log analysis: statistical modeling and heuristic trend analysis , 1990 .
[2] Marco Aurélio Amaral Henriques,et al. Speedup and scalability analysis of Master-Slave applications on large heterogeneous clusters , 2007, J. Parallel Distributed Comput..
[3] James H. Laros,et al. Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[4] Zhiling Lan,et al. Adaptive Fault Management of Parallel Applications for High-Performance Computing , 2008, IEEE Transactions on Computers.
[5] Franck Cappello,et al. SPBC: Leveraging the characteristics of MPI HPC applications for scalable checkpointing , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[6] Jon Stearley,et al. What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[7] James S. Plank,et al. Experimental assessment of workstation failures and their impact on checkpointing systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).
[8] Zhiling Lan,et al. Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.
[9] Ron A. Oldfield. Lightweight storage and overlay networks for fault tolerance. , 2006 .
[10] Kishor S. Trivedi,et al. Minimizing completion time of a program by checkpointing and rejuvenation , 1996, SIGMETRICS '96.
[11] Horst Rinne,et al. The Weibull Distribution: A Handbook , 2008 .
[12] Christopher D. Carothers,et al. An analysis of clustered failures on large supercomputing systems , 2009, J. Parallel Distributed Comput..
[13] Mark S. Squillante,et al. Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.
[14] James S. Plank,et al. Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..
[15] Rong Ge,et al. Power-Aware Speedup , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[16] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[17] Seetharami R. Seelam,et al. Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).
[18] John L. Gustafson,et al. Reevaluating Amdahl's law , 1988, CACM.
[19] Sarala Arunagiri,et al. Opportunistic Checkpoint Intervals to Improve System Performance , 2008 .
[20] Zhiling Lan,et al. Reliability-aware scalability models for high performance computing , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.
[21] Ravishankar K. Iyer,et al. Modeling coordinated checkpointing for large-scale supercomputers , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).
[22] Mukesh Singhal,et al. Checkpointing with mutable checkpoints , 2003, Theor. Comput. Sci..
[23] Dilma Da Silva,et al. Alleviating scalability issues of checkpointing protocols , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[24] Vipin Kumar,et al. Analysis of scalability of parallel algorithms and architectures: a survey , 1991, ICS '91.
[25] C. Murray Woodside,et al. Evaluating the scalability of distributed systems , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.
[26] Peter H. Beckman,et al. Analyzing Checkpointing Trends for Applications on the IBM Blue Gene/P System , 2009, 2009 International Conference on Parallel Processing Workshops.
[27] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..
[28] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[29] Zhiling Lan,et al. A fast restart mechanism for checkpoint/recovery protocols in networked environments , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).
[30] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[31] Zhiling Lan,et al. Filtering log data: Finding the needles in the Haystack , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).
[32] Sarah Ellen Michalak,et al. Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).
[33] Christian Engelmann,et al. Blue Gene/L Log Analysis and Time to Interrupt Estimation , 2009, 2009 International Conference on Availability, Reliability and Security.
[34] Ming Wu,et al. Performance under failures of high-end computing , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[35] Meeta Sharma Gupta,et al. Performance implications of periodic checkpointing on large-scale cluster systems , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.
[36] Thomas Hérault,et al. Improved message logging versus improved coordinated checkpointing for fault tolerant MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[37] Xian-He Sun,et al. Optimizing HPC Fault-Tolerant Environment: An Analytical Approach , 2010, 2010 39th International Conference on Parallel Processing.
[38] Franck Cappello,et al. Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[39] W YoungJohn. A first order approximation to the optimum checkpoint interval , 1974 .
[40] Fabrizio Petrini,et al. System-level fault-tolerance in large-scale parallel machines with buffered coscheduling , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[41] Chao Wang,et al. Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[42] Charng-Da Lu,et al. Big Systems and Big Reliability Challenges , 2003, PARCO.
[43] Stephen L. Scott,et al. Reliability of a System of k Nodes for High Performance Computing Applications , 2010, IEEE Transactions on Reliability.
[44] Robert Latham,et al. I/O performance challenges at leadership scale , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[45] Hsien-Hsin S. Lee,et al. Extending Amdahl's Law for Energy-Efficient Computing in the Many-Core Era , 2008, Computer.
[46] John T. Daly,et al. Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters , 2010, HPDC '10.
[47] E. N. Elnozahy,et al. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.
[48] Stephen L. Scott,et al. An optimal checkpoint/restart model for a large scale high performance computing system , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[49] Vivek Sarkar,et al. Software challenges in extreme scale systems , 2009 .
[50] Narayan Desai,et al. Co-analysis of RAS Log and Job Log on Blue Gene/P , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[51] Lionel M. Ni,et al. Another view on parallel speedup , 1990, Proceedings SUPERCOMPUTING '90.
[52] Henri Casanova,et al. Checkpointing strategies for parallel jobs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[53] Larry Rudolph,et al. Cooperative checkpointing: a robust approach to large-scale systems reliability , 2006, ICS '06.
[54] Victor F. Nicola,et al. Checkpointing and the modeling of program execution time , 1994 .
[55] G. Amdhal,et al. Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).