MULTS: A multi-cloud fault-tolerant architecture to manage transient servers in cloud computing

Abstract The large-scale utilization of cloud computing resources has led to the emergence of cloud environment reliability as an important issue. In addition, cloud providers are negotiating unreliable virtual machines as a result of exploring unused resources offering them as transient servers - a lower price virtual machine service with resource revocations without user intervention. To increase the availability of transient servers, we propose a multi-cloud fault-tolerant architecture to provide a resilient environment using a scenario-based optimal checkpoint in a scheme to guarantee running processes with reduced user costs. The architecture combines a heuristic to extract information from a case-based reasoning and a statistical model to predict failure events helping to refine fault tolerance parameters. As a result, a cloud environment with better levels of reliability and reduced execution time is provided. Extensive simulations show high levels of accuracy reaching up to 92% survival prediction success rate and a gain of 74,58% of execution time reduction for long running applications. The results are promising, indicating that the proposed architecture can prevent revocation failures under realistic working conditions.

[1]  Célia Ghedini Ralha,et al.  A Resilient Agent-Based Architecture for Efficient Usage of Transient Servers in Cloud Computing , 2018, 2018 IEEE International Conference on Cloud Computing Technology and Science (CloudCom).

[2]  Turgay Celik,et al.  Toward a Smart Cloud: A Review of Fault-Tolerance Methods in Cloud Systems , 2018, IEEE Transactions on Services Computing.

[3]  Bu-Sung Lee,et al.  Improving Hadoop Monetary Efficiency in the Cloud Using Spot Instances , 2014, 2014 IEEE 6th International Conference on Cloud Computing Technology and Science.

[4]  Chandra Prakash Gupta,et al.  Perceptive bidding strategy for Amazon EC2 spot instance market , 2018, Multiagent Grid Syst..

[5]  David E. Irwin,et al.  Transient Guarantees: Maximizing the Value of Idle Cloud Capacity , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Artur Andrzejak,et al.  Monetary Cost-Aware Checkpointing and Migration on Amazon Cloud Spot Instances , 2012, IEEE Transactions on Services Computing.

[7]  Célia Ghedini Ralha,et al.  A Prediction Approach to Define Checkpoint Intervals in Spot Instances , 2018, CLOUD.

[8]  Kyungyong Lee,et al.  DeepSpotCloud: Leveraging Cross-Region GPU Spot Instances for Deep Learning , 2017, 2017 IEEE 10th International Conference on Cloud Computing (CLOUD).

[9]  Rajkumar Buyya,et al.  Reliable Provisioning of Spot Instances for Compute-intensive Applications , 2011, 2012 IEEE 26th International Conference on Advanced Information Networking and Applications.

[10]  Célia Ghedini Ralha,et al.  MASE-BDI: agent-based simulator for environmental land change with efficient and parallel auto-tuning , 2016, Applied Intelligence.

[11]  Nandini Mukherjee,et al.  Application-Centric Resource Provisioning for Amazon EC2 Spot Instances , 2012, Euro-Par.

[12]  Maria Kihl,et al.  On a Feedback Control-Based Mechanism of Bidding for Cloud Spot Service , 2015, 2015 IEEE 7th International Conference on Cloud Computing Technology and Science (CloudCom).

[13]  Richard Wolski,et al.  Providing statistical reliability guarantees in the AWS spot tier , 2016, SpringSim.

[14]  Prateek Sharma,et al.  SpotOn: a batch computing service for the spot market , 2015, SoCC.

[15]  Biswanath Mukherjee,et al.  A Survey on Resiliency Techniques in Cloud Computing Infrastructures and Applications , 2016, IEEE Communications Surveys & Tutorials.

[16]  Miguel Correia,et al.  State machine replication in containers managed by Kubernetes , 2017, J. Syst. Archit..

[17]  Rajkumar Buyya,et al.  Statistical Modeling of Spot Instance Prices in Public Cloud Environments , 2011, 2011 Fourth IEEE International Conference on Utility and Cloud Computing.

[18]  Paul D. Allison,et al.  Survival analysis using sas®: a practical guide , 1995 .

[19]  M. Goel,et al.  Understanding survival analysis: Kaplan-Meier estimate , 2010, International journal of Ayurveda research.

[20]  Rajkumar Buyya,et al.  Provisioning Spot Market Cloud Resources to Create Cost-Effective Virtual Clusters , 2011, ICA3PP.

[21]  Prateek Sharma,et al.  Portfolio-driven Resource Management for Transient Cloud Servers , 2017, SIGMETRICS.

[22]  Nandini Mukherjee,et al.  Application-centric Cloud management , 2011, 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA).

[23]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[24]  Edward H. Shortliffe,et al.  Production Rules as a Representation for a Knowledge-Based Consultation Program , 1977, Artif. Intell..

[25]  Xin Xu,et al.  Task scheduling with fault-tolerance in real-time heterogeneous systems , 2018, J. Syst. Archit..

[26]  Rupert G. Miller,et al.  Survival Analysis , 2022, The SAGE Encyclopedia of Research Design.

[27]  Jun Zhou,et al.  Fault Tolerant Stencil Computation on Cloud-Based GPU Spot Instances , 2019, IEEE Transactions on Cloud Computing.

[28]  Stefan Hauck-Stattelmann,et al.  Container-based architecture for flexible industrial control applications , 2018, J. Syst. Archit..

[29]  Célia Ghedini Ralha,et al.  A multi-agent model system for land-use change simulation , 2013, Environ. Model. Softw..

[30]  Michael E. Bratman,et al.  Intention, Plans, and Practical Reason , 1991 .

[31]  Célia Ghedini Ralha,et al.  Multiagent system for dynamic resource provisioning in cloud computing platforms , 2019, Future Gener. Comput. Syst..

[32]  Rajkumar Buyya,et al.  A reliable and cost-efficient auto-scaling system for web applications using heterogeneous spot instances , 2015, J. Netw. Comput. Appl..

[33]  Rajkumar Buyya,et al.  Article in Press Future Generation Computer Systems ( ) – Future Generation Computer Systems Cloud Computing and Emerging It Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility , 2022 .

[34]  D. Cox,et al.  Analysis of Survival Data. , 1985 .

[35]  Agnar Aamodt,et al.  Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches , 1994, AI Commun..

[36]  Artur Andrzejak,et al.  Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[37]  Prateek Sharma,et al.  Managing Risk in a Derivative IaaS Cloud , 2018, IEEE Transactions on Parallel and Distributed Systems.

[38]  Seo-Young Noh,et al.  Experimental Study of Bidding Strategies for Scientific Workflows using AWS Spot Instances , 2015 .

[39]  Rajkumar Buyya,et al.  Characterizing spot price dynamics in public cloud environments , 2013, Future Gener. Comput. Syst..

[40]  Kaushik Dutta,et al.  Dynamic Price Prediction for Amazon Spot Instances , 2015, 2015 48th Hawaii International Conference on System Sciences.

[41]  Wei Chen,et al.  MORM: A Multi-objective Optimized Replication Management strategy for cloud storage cluster , 2014, J. Syst. Archit..

[42]  Thomas Roth-Berghofer,et al.  Rapid Prototyping of CBR Applications , 2008, Künstliche Intell..