Virtual Redundancy for Active-Standby Cloud Applications

VM redundancy is the foundation of resilient cloud applications. While active-active approaches combined with load balancing and autoscaling are usually resource efficient, the stateful nature of many cloud applications often necessitates 1+1 (or $1+\mathbf{n}$) active-standby approaches. Keeping the standbys, however, could result in inefficient utilization of cloud resources. We explore an intriguing cloud-based solution, where standby VMs from active-standby applications are selectively overbooked to reduce resources reserved for failures. The approach requires careful VM placement to avoid a situation where multiple standby VMs activate simultaneously on the same host and thus cannot get the full resource entitlement. Indeed today's clouds do not have this visibility to the applications. We rectify this situation through ShadowBox, a novel redundancy-aware VM scheduler that optimizes the placement and activation of standby VMs, while assuring applications' resource entitlements. Evaluation on a large-scale cloud shows that ShadowBox can significantly improve resource utilization (i.e., more than 2.5 times than traditional approaches) while minimizing the impact on applications' entitlements.

[1]  Franck Cappello,et al.  Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[2]  Paul Marshall,et al.  Improving Utilization of Infrastructure Clouds , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[3]  Zibin Zheng,et al.  Cloud Service Reliability Enhancement via Virtual Machine Placement Optimization , 2017, IEEE Transactions on Services Computing.

[4]  H. Waldman,et al.  Dynamic provisioning of shared-backup path protected connections with guaranteed availability requirements , 2005, 2nd International Conference on Broadband Networks, 2005..

[5]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[6]  Ling Zhou,et al.  Connection Availability Analysis of Shared Backup Path-Protected Mesh Networks , 2007, Journal of Lightwave Technology.

[7]  Franck Cappello,et al.  BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8]  Daniel Sun,et al.  Reliability and energy efficiency in cloud computing systems: Survey and taxonomy , 2016, J. Netw. Comput. Appl..

[9]  Ricardo Bianchini,et al.  History-Based Harvesting of Spare Cycles and Storage in Large-Scale Datacenters , 2016, OSDI.

[10]  Calton Pu,et al.  Performance and availability aware regeneration for cloud based multitier applications , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[11]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[12]  Kannan Ramchandran,et al.  A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster , 2013, HotStorage.

[13]  Franck Cappello,et al.  Optimization of cloud task processing with checkpoint-restart mechanism , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[15]  Prashant J. Shenoy,et al.  Resource overbooking and application profiling in a shared Internet hosting platform , 2009, TOIT.

[16]  Raymond H. Putra,et al.  Dependable virtual machine allocation , 2013, 2013 Proceedings IEEE INFOCOM.

[17]  Elliot K. Kolodner,et al.  Guaranteeing High Availability Goals for Virtual Machine Placement , 2011, 2011 31st International Conference on Distributed Computing Systems.

[18]  Ning Ding,et al.  The only constant is change: incorporating time-varying network reservations in data centers , 2012, SIGCOMM.

[19]  Massoud Pedram,et al.  Energy-Efficient Virtual Machine Replication and Placement in a Cloud Computing System , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[20]  H. Howie Huang,et al.  Providing reliability as an elastic service in cloud computing , 2012, 2012 IEEE International Conference on Communications (ICC).

[21]  Calton Pu,et al.  Improving Performance and Availability of Services Hosted on IaaS Clouds with Structural Constraint-Aware Virtual Machine Placement , 2011, 2011 IEEE International Conference on Services Computing.

[22]  Algirdas Avizienis,et al.  Design of fault-tolerant computers , 1967, AFIPS '67 (Fall).