Optimizing Job Reliability Through Contention-Free, Distributed Checkpoint Scheduling

A datacenter that consists of hundreds or thousands of servers can provide virtualized environments to a large number of cloud applications and jobs that value the requirement of reliability very differently. Checkpointing a virtual machine (VM) is a proven technique to improve reliability. However, existing checkpoint scheduling techniques for enhancing reliability of distributed systems fails to achieve satisfactory results, either because they tend to offer the same, fixed reliability to all jobs, or because their solutions are tied up to specific applications and rely on centralized checkpoint control mechanisms. In this work, we first show that reliability can be significantly improved through contention-free scheduling of checkpoints. Then, inspired by the Carrier Sense Multiple Access (CSMA) protocol in wireless congestion control, we propose a novel framework for distributed and contention-free scheduling of VM checkpointing to provide reliability as a transparent, elastic service. We quantify reliability in closed form by studying system stationary behaviours, and maximize job reliability through utility optimization. Our design is validated via a proof-of-concept prototype that leverages readily available implementations in Xen hypervisors. The proposed checkpoint scheduling is shown to significantly reduce checkpointing interference and improve reliability by as much as one order of magnitude over contention-oblivious checkpoint schemes.

[1]  Jordi Torres,et al.  Checkpoint-based fault-tolerant infrastructure for virtualized service providers , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[2]  R. Badrinath,et al.  Virtualization aware job schedulers for checkpoint-restart , 2007, 2007 International Conference on Parallel and Distributed Systems.

[3]  Thomas Hérault,et al.  Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[4]  M. Wiboonrat An Empirical Study on Data Center System Failure Diagnosis , 2008, 2008 The Third International Conference on Internet Monitoring and Protection.

[5]  Tadashi Dohi,et al.  A dynamic checkpointing scheme based on reinforcement learning , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..

[6]  Jean C. Walrand,et al.  A Distributed CSMA Algorithm for Throughput and Utility Maximization in Wireless Networks , 2010, IEEE/ACM Transactions on Networking.

[7]  Hai Jin,et al.  Optimize Performance of Virtual Machine Checkpointing via Memory Exclusion , 2009, 2009 Fourth ChinaGrid Annual Conference.

[8]  Peter Norvig,et al.  Artificial intelligence - a modern approach, 2nd Edition , 2003, Prentice Hall series in artificial intelligence.

[9]  Muli Ben-Yehuda,et al.  Virtual machine time travel using continuous data protection and checkpointing , 2008, OPSR.

[10]  Daniel Sun,et al.  Reliability and energy efficiency in cloud computing systems: Survey and taxonomy , 2016, J. Netw. Comput. Appl..

[11]  Irene Zhang,et al.  Optimizing VM Checkpointing for Restore Performance in VMware ESXi , 2013, USENIX Annual Technical Conference.

[12]  Ion Stoica,et al.  Failure as a Service (FaaS): A Cloud Service for Large- Scale, Online Failure Drills , 2011 .

[13]  Stephen L. Scott,et al.  An optimal checkpoint/restart model for a large scale high performance computing system , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[14]  Andrew Warfield,et al.  Parallax: Managing Storage for a Million Machines , 2005, HotOS.

[15]  Andrzej Duda,et al.  The Effects of Checkpointing on Program Execution Time , 1983, Inf. Process. Lett..

[16]  Hai Jin,et al.  VirtCFT: A Transparent VM-Level Fault-Tolerant System for Virtual Clusters , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[17]  A. Robert Calderbank,et al.  Layering as Optimization Decomposition: A Mathematical Theory of Network Architectures , 2007, Proceedings of the IEEE.

[18]  Mung Chiang,et al.  An Axiomatic Theory of Fairness in Resource Allocation , 2010 .

[19]  Tadashi Dohi,et al.  Distribution-free checkpoint placement algorithms based on min-max principle , 2006, IEEE Transactions on Dependable and Secure Computing.

[20]  Douglas M. Blough,et al.  Fast, Lightweight Virtual Machine Checkpointing , 2010 .

[21]  Mung Chiang,et al.  Stability and Benefits of Suboptimal Utility Maximization , 2011, IEEE/ACM Transactions on Networking.

[22]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[23]  Tadashi Dohi,et al.  Optimal Checkpoint Placement with Equality Constraints , 2006, 2006 2nd IEEE International Symposium on Dependable, Autonomic and Secure Computing.

[24]  Nitin H. Vaidya,et al.  Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme , 1997, IEEE Trans. Computers.

[25]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[26]  Mung Chiang,et al.  Multiresource Allocation: Fairness–Efficiency Tradeoffs in a Unifying Framework , 2012, IEEE/ACM Transactions on Networking.

[27]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[28]  Koushik Kar,et al.  Throughput modelling and fairness issues in CSMA/CA based ad-hoc networks , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[29]  Soung Chang Liew,et al.  Back-of-the-Envelope Computation of Throughput Distributions in CSMA Wireless Networks , 2007, 2009 IEEE International Conference on Communications.

[30]  Tadashi Dohi,et al.  Availability models with age-dependent checkpointing , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..

[31]  H. Howie Huang,et al.  Providing reliability as an elastic service in cloud computing , 2012, 2012 IEEE International Conference on Communications (ICC).

[32]  Sheldon M. Ross,et al.  Introduction to Probability Models (4th ed.). , 1990 .

[33]  Franck Cappello,et al.  BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[34]  Tadashi Dohi,et al.  Bayesian perspective of optimal checkpoint placement , 2005, Ninth IEEE International Symposium on High-Assurance Systems Engineering (HASE'05).

[35]  K. Mani Chandy,et al.  A Survey of Analytic Models of Rollback and Recovery Stratergies , 1975, Computer.

[36]  Tao Ke,et al.  Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[37]  Harshpreet Singh,et al.  Review on Fault Tolerance Techniques in Cloud Computing , 2015 .

[38]  Ulas C. Kozat,et al.  In-network live snapshot service for recovering virtual infrastructures , 2011, IEEE Network.