Orchestrating Fault Prediction with Live Migration and Checkpointing
暂无分享,去创建一个
[1] Yves Robert,et al. Towards Optimal Multi-Level Checkpointing , 2017, IEEE Transactions on Computers.
[2] Chao Wang,et al. Proactive process-level live migration and back migration in HPC environments , 2012, J. Parallel Distributed Comput..
[3] Misbah Mubarak,et al. Evaluating Burst Buffer Placement in HPC Systems , 2019, 2019 IEEE International Conference on Cluster Computing (CLUSTER).
[4] Andrew A. Chien,et al. How Much SSD Is Useful for Resilience in Supercomputers , 2015, FTXS@HPDC.
[5] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[6] Dong H. Ahn,et al. Flux: Overcoming Scheduling Challenges for Exascale Workflows , 2018, 2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS).
[7] Scott Klasky,et al. Comprehensive Measurement and Analysis of the User-Perceived I/O Performance in a Production Leadership-Class Storage System , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).
[8] Frank Mueller,et al. Desh: deep learning for system health prediction of lead times to failure in HPC , 2018, HPDC.
[9] Surendra Byna,et al. Accelerating Science with the NERSC Burst Buffer Early User Program , 2016 .
[10] Franck Cappello,et al. Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[11] Scott B. Baden,et al. Doomsday: Predicting Which Node Will Fail When on Supercomputers , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[12] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[13] Satoshi Matsuoka,et al. Design and modeling of a non-blocking checkpointing system , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[14] Franck Cappello,et al. Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[15] Robert B. Ross,et al. On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).
[16] Andy B. Yoo,et al. Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .
[17] Franck Cappello,et al. VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[18] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[19] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[20] Franck Cappello,et al. Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model , 2017, IEEE Transactions on Parallel and Distributed Systems.
[21] Saurabh Gupta,et al. Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[22] Tirthak Patel,et al. Shiraz: Exploiting System Reliability and Application Resilience Characteristics to Improve Large Scale System Throughput , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[23] Sathish S. Vadhiyar,et al. Fault Tolerance on Large Scale Systems using Adaptive Process Replication , 2015, IEEE Transactions on Computers.
[24] Lipeng Wan,et al. Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems , 2017, J. Parallel Distributed Comput..
[25] Christian Engelmann,et al. Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors , 2002 .
[26] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[27] Bronis R. de Supinski,et al. The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[28] Chao Wang,et al. Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[29] Franck Cappello,et al. Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[30] Stephen L. Scott,et al. An optimal checkpoint/restart model for a large scale high performance computing system , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[31] Satoshi Matsuoka,et al. A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.
[32] Kamil Iskra,et al. ZOID: I/O-forwarding infrastructure for petascale architectures , 2008, PPoPP.
[33] John Shalf,et al. DOE Advanced Scientific Computing Advisory Subcommittee (ASCAC) Report: Top Ten Exascale Research Challenges , 2014 .