Optimizing Job Reliability Through Contention-Free, Distributed Checkpoint Scheduling
暂无分享,去创建一个
Suresh Subramaniam | Tian Lan | Hang Liu | Howie Huang | Yu Xiang | Tian Lan | S. Subramaniam | Howie Huang | Yu Xiang | Hang Liu
[1] Jordi Torres,et al. Checkpoint-based fault-tolerant infrastructure for virtualized service providers , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.
[2] R. Badrinath,et al. Virtualization aware job schedulers for checkpoint-restart , 2007, 2007 International Conference on Parallel and Distributed Systems.
[3] Thomas Hérault,et al. Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[4] M. Wiboonrat. An Empirical Study on Data Center System Failure Diagnosis , 2008, 2008 The Third International Conference on Internet Monitoring and Protection.
[5] Tadashi Dohi,et al. A dynamic checkpointing scheme based on reinforcement learning , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..
[6] Jean C. Walrand,et al. A Distributed CSMA Algorithm for Throughput and Utility Maximization in Wireless Networks , 2010, IEEE/ACM Transactions on Networking.
[7] Hai Jin,et al. Optimize Performance of Virtual Machine Checkpointing via Memory Exclusion , 2009, 2009 Fourth ChinaGrid Annual Conference.
[8] Peter Norvig,et al. Artificial intelligence - a modern approach, 2nd Edition , 2003, Prentice Hall series in artificial intelligence.
[9] Muli Ben-Yehuda,et al. Virtual machine time travel using continuous data protection and checkpointing , 2008, OPSR.
[10] Daniel Sun,et al. Reliability and energy efficiency in cloud computing systems: Survey and taxonomy , 2016, J. Netw. Comput. Appl..
[11] Irene Zhang,et al. Optimizing VM Checkpointing for Restore Performance in VMware ESXi , 2013, USENIX Annual Technical Conference.
[12] Ion Stoica,et al. Failure as a Service (FaaS): A Cloud Service for Large- Scale, Online Failure Drills , 2011 .
[13] Stephen L. Scott,et al. An optimal checkpoint/restart model for a large scale high performance computing system , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[14] Andrew Warfield,et al. Parallax: Managing Storage for a Million Machines , 2005, HotOS.
[15] Andrzej Duda,et al. The Effects of Checkpointing on Program Execution Time , 1983, Inf. Process. Lett..
[16] Hai Jin,et al. VirtCFT: A Transparent VM-Level Fault-Tolerant System for Virtual Clusters , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.
[17] A. Robert Calderbank,et al. Layering as Optimization Decomposition: A Mathematical Theory of Network Architectures , 2007, Proceedings of the IEEE.
[18] Mung Chiang,et al. An Axiomatic Theory of Fairness in Resource Allocation , 2010 .
[19] Tadashi Dohi,et al. Distribution-free checkpoint placement algorithms based on min-max principle , 2006, IEEE Transactions on Dependable and Secure Computing.
[20] Douglas M. Blough,et al. Fast, Lightweight Virtual Machine Checkpointing , 2010 .
[21] Mung Chiang,et al. Stability and Benefits of Suboptimal Utility Maximization , 2011, IEEE/ACM Transactions on Networking.
[22] Eduardo Pinheiro,et al. DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.
[23] Tadashi Dohi,et al. Optimal Checkpoint Placement with Equality Constraints , 2006, 2006 2nd IEEE International Symposium on Dependable, Autonomic and Secure Computing.
[24] Nitin H. Vaidya,et al. Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme , 1997, IEEE Trans. Computers.
[25] C. D. Gelatt,et al. Optimization by Simulated Annealing , 1983, Science.
[26] Mung Chiang,et al. Multiresource Allocation: Fairness–Efficiency Tradeoffs in a Unifying Framework , 2012, IEEE/ACM Transactions on Networking.
[27] Jack J. Dongarra,et al. The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..
[28] Koushik Kar,et al. Throughput modelling and fairness issues in CSMA/CA based ad-hoc networks , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..
[29] Soung Chang Liew,et al. Back-of-the-Envelope Computation of Throughput Distributions in CSMA Wireless Networks , 2007, 2009 IEEE International Conference on Communications.
[30] Tadashi Dohi,et al. Availability models with age-dependent checkpointing , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..
[31] H. Howie Huang,et al. Providing reliability as an elastic service in cloud computing , 2012, 2012 IEEE International Conference on Communications (ICC).
[32] Sheldon M. Ross,et al. Introduction to Probability Models (4th ed.). , 1990 .
[33] Franck Cappello,et al. BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[34] Tadashi Dohi,et al. Bayesian perspective of optimal checkpoint placement , 2005, Ninth IEEE International Symposium on High-Assurance Systems Engineering (HASE'05).
[35] K. Mani Chandy,et al. A Survey of Analytic Models of Rollback and Recovery Stratergies , 1975, Computer.
[36] Tao Ke,et al. Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).
[37] Harshpreet Singh,et al. Review on Fault Tolerance Techniques in Cloud Computing , 2015 .
[38] Ulas C. Kozat,et al. In-network live snapshot service for recovering virtual infrastructures , 2011, IEEE Network.