Resilient Techniques Against Disruptions of Volatile Cloud Resources

On-Demand cloud resources are highly available and reliable since most common cloud service providers organize their clouds as a network of several regions (data centres) and multiple availability zones in each region. This redundant and highly distributed resource pool guarantees users high availability and reliability, even in case of disasters. In order to increase revenues, cloud service providers offer their unused computing resources for much cheaper prices than On-Demand resources, in the form of volatile cloud resources. The trade-off for the high discount is their volatile ability, i.e. lower availability and lower reliability. This means that a user can lose part or all volatile resources at any time, similar to a large-scale technology-related massive failure (disaster). This chapter introduces volatile cloud resources, their life cycle, pros and cons. It also presents several resilient techniques against volatile cloud resources’ disruptions and multiple failures.

[1]  Radu Prodan,et al.  Fault Detection, Prevention and Recovery in Current Grid Workflow Systems , 2008, CoreGRID Workshop on Grid Middleware.

[2]  Markus Lumpe,et al.  On Estimating Bids for Amazon EC2 Spot Instances Using Time Series Forecasting , 2017, 2017 IEEE International Conference on Services Computing (SCC).

[3]  Guillaume Pierre,et al.  EC2 Performance Analysis for Resource Provisioning of Service-Oriented Applications , 2009, ICSOC/ServiceWave Workshops.

[4]  Yves Robert,et al.  Fault tolerant scheduling of precedence task graphs on heterogeneous platforms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[5]  Artur Andrzejak,et al.  Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[6]  Jinjun Chen,et al.  Adaptive selection of necessary and sufficient checkpoints for dynamic verification of temporal constraints in grid workflow systems , 2007, TAAS.

[7]  Rajkumar Buyya,et al.  Meeting Deadlines of Scientific Workflows in Public Clouds with Tasks Replication , 2014, IEEE Transactions on Parallel and Distributed Systems.

[8]  Sasko Ristov,et al.  Performance and Behavior Characterization of Amazon EC2 Spot Instances , 2018, 2018 IEEE 11th International Conference on Cloud Computing (CLOUD).

[9]  Bryan Ng,et al.  Cost-Aware Cloud Profiling, Prediction, and Provisioning as a Service , 2017, IEEE Cloud Computing.

[10]  Yookun Cho,et al.  Taking Point Decision Mechanism for Page-level Incremental Checkpointing based on Cost Analysis of Process Execution Time , 2006, J. Inf. Sci. Eng..

[11]  Weimin Zheng,et al.  Bidding for Highly Available Services with Low Price in Spot Instance Market , 2015, HPDC.

[12]  Felix C. Gärtner,et al.  Fundamentals of fault-tolerant distributed computing in asynchronous environments , 1999, CSUR.

[13]  Ming Mao,et al.  A Performance Study on the VM Startup Time in the Cloud , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[14]  Marjan Gusev,et al.  An overview of security challenges in communication networks , 2016, 2016 8th International Workshop on Resilient Networks Design and Modeling (RNDM).

[15]  Luísa Jorge,et al.  A survey on network resiliency methodologies against weather-based disruptions , 2016, 2016 8th International Workshop on Resilient Networks Design and Modeling (RNDM).

[16]  Muli Ben-Yehuda,et al.  Deconstructing Amazon EC2 Spot Instance Pricing , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[17]  Rajkumar Buyya,et al.  Fault-tolerant Workflow Scheduling using Spot Instances on Clouds , 2014, ICCS.

[18]  Stefano Secci,et al.  A survey of strategies for communication networks to protect against large-scale natural disasters , 2016, 2016 8th International Workshop on Resilient Networks Design and Modeling (RNDM).

[19]  Qinglin Zhao,et al.  Support for spot virtual machine purchasing simulation , 2018, Cluster Computing.

[20]  Rajkumar Buyya,et al.  Reliable Provisioning of Spot Instances for Compute-intensive Applications , 2011, 2012 IEEE 26th International Conference on Advanced Information Networking and Applications.

[21]  Bogumil Kaminski,et al.  On optimization of simulation execution on Amazon EC2 spot market , 2015, Simul. Model. Pract. Theory.

[22]  Michele Mazzucco,et al.  Achieving Performance and Availability Guarantees with Spot Instances , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[23]  Miron Livny,et al.  Data placement for scientific applications in distributed environments , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[24]  Rajkumar Buyya,et al.  Statistical Modeling of Spot Instance Prices in Public Cloud Environments , 2011, 2011 Fourth IEEE International Conference on Utility and Cloud Computing.

[25]  Kaushik Dutta,et al.  Dynamic Price Prediction for Amazon Spot Instances , 2015, 2015 48th Hawaii International Conference on System Sciences.

[26]  Richard Wolski,et al.  Probabilistic Guarantees of Execution Duration for Amazon Spot Instances , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Stefano Secci,et al.  Technology-related disasters: A survey towards disaster-resilient Software Defined Networks , 2016, 2016 8th International Workshop on Resilient Networks Design and Modeling (RNDM).

[28]  Qianlin Liang,et al.  An Empirical Analysis of Amazon EC2 Spot Instance Features Affecting Cost-effective Resource Procurement , 2017, ICPE.

[29]  Radu Prodan,et al.  Analysing the Performance Instability Correlation with Various Workflow and Cloud Parameters , 2017, 2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).

[30]  Ian T. Foster,et al.  Cost-Aware Cloud Provisioning , 2015, 2015 IEEE 11th International Conference on e-Science.

[31]  Rajkumar Buyya,et al.  Enhancing Reliability of Workflow Execution Using Task Replication and Spot Instances , 2016, ACM Trans. Auton. Adapt. Syst..

[32]  Sowmya Karunakaran,et al.  Bidding Strategies for Spot Instances in Cloud Computing Markets , 2015, IEEE Internet Computing.

[33]  Shaojie Tang,et al.  Towards Optimal Bidding Strategy for Amazon EC2 Cloud Spot Instance , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[34]  Rami G. Melhem,et al.  Analysis of a fault-tolerant multiprocessor scheduling algorithm , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[35]  Francisco Vilar Brasileiro,et al.  On the efficacy, efficiency and emergent behavior of task replication in large distributed systems , 2007, Parallel Comput..

[36]  Joe Weinman,et al.  The Economics of the Hybrid Multicloud Fog , 2017, IEEE Cloud Computing.

[37]  Rajkumar Buyya,et al.  A taxonomy of scientific workflow systems for grid computing , 2005, SGMD.

[38]  Chandra Prakash Gupta,et al.  Amazon EC2 Spot Price Prediction Using Regression Random Forests , 2020, IEEE Transactions on Cloud Computing.

[39]  Ulf Leser,et al.  DynamicCloudSim: Simulating heterogeneity in computational clouds , 2015, Future Gener. Comput. Syst..

[40]  Alexandru Iosup,et al.  A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing , 2009, CloudComp.