Optimizing Backup Resources in the Cloud

Cloud computing promises high performance and cost-efficiency, however, most cloud infrastructures operate at low utilization which greatly adhere cost effectiveness. Previous works focus on seeking efficient virtual machine (VM) consolidation strategies to increase the utilization of virtual resources in production environment, while overlooking the under-utilization of backup virtual resources. We propose a heuristic time sharing policy derived from the restless multi-armed bandit problem. The proposed policy achieves increasing backup virtual resources utilization while providing high availability. The experiment results show that the traditional 1:1 backup provision can be extended to 1:M (M>>1) between the backup VM and the service VMs, and the utilization of backup VMs can be enhanced significantly.

[1]  Zhenhuan Gong,et al.  PAC: Pattern-driven Application Consolidation for Efficient Cloud Computing , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[2]  Morten Sorensen,et al.  Learning By Investing: Evidence from Venture Capital , 2008 .

[3]  P. Whittle Restless Bandits: Activity Allocation in a Changing World , 1988 .

[4]  Elliot K. Kolodner,et al.  Guaranteeing High Availability Goals for Virtual Machine Placement , 2011, 2011 31st International Conference on Distributed Computing Systems.

[5]  Gregory Levitin,et al.  Optimal Backup Distribution in 1-out-of- ${N}$ Cold Standby Systems , 2015, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[6]  Jong Sou Park,et al.  Availability Modeling and Analysis on Virtualized Clustering with Rejuvenation , 2008 .

[7]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[8]  Jie Liu,et al.  Underprovisioning backup power infrastructure for datacenters , 2014, ASPLOS.

[9]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[10]  Sebastian Thrun,et al.  Monte Carlo POMDPs , 1999, NIPS.

[11]  Swapna S. Gokhale Software failure rate and reliability incorporating repair policies , 2004, 10th International Symposium on Software Metrics, 2004. Proceedings..

[12]  Mark S. Squillante,et al.  Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.

[13]  Jie Liu,et al.  Algorithm Design for Performance Aware VM Consolidation , 2013 .

[14]  Eric R. Ziegel,et al.  System Reliability Theory: Models, Statistical Methods, and Applications , 2004, Technometrics.

[15]  Xuemei Zhang,et al.  Some successful approaches to software reliability modeling in industry , 2005, J. Syst. Softw..

[16]  Swapna S. Gokhale,et al.  Software failure rate and reliability incorporating repair policies , 2004 .

[17]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[18]  Andrzej Kochut,et al.  Dynamic Placement of Virtual Machines for Managing SLA Violations , 2007, 2007 10th IFIP/IEEE International Symposium on Integrated Network Management.

[19]  Mingyan Liu,et al.  Multi-channel opportunistic access: A case of restless bandits with multiple plays , 2009, 2009 47th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[20]  Pengfei Chen,et al.  CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[21]  Katerina Goseva-Popstojanova,et al.  Modeling and analysis of software aging and rejuvenation , 2000, Proceedings 33rd Annual Simulation Symposium (SS 2000).

[22]  Bhaskar Krishnamachari,et al.  On myopic sensing for multi-channel opportunistic access: structure, optimality, and performance , 2007, IEEE Transactions on Wireless Communications.

[23]  Barbara F. La Scala,et al.  Optimal target tracking with restless bandits , 2006, Digit. Signal Process..

[24]  Tadashi Dohi,et al.  Software reliability growth models with normal failure time distributions , 2013, Reliab. Eng. Syst. Saf..

[26]  Rajkumar Buyya,et al.  Software Rejuvenation Based Fault Tolerance Scheme for Cloud Applications , 2015, 2015 IEEE 8th International Conference on Cloud Computing.

[27]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.