Runtime recovery actions selection for sporadic operations on public cloud

Sporadic operations such as rolling upgrade or machine instance redeployment are prone to unpredictable failures in the public cloud largely because of the inherent high variability nature of public cloud. Previous dependability research has established several recovery methods for cloud failures. In this paper, we first propose eight recovery patterns for sporadic operations on public cloud. We then present the filtering process which filters applicable recovery patterns. We propose an automation mechanism to automatically generate recovery actions for those applicable recovery patterns based on our resource state transition algorithm. We also propose a methodology to evaluate the recovery actions generated for the applicable recovery patterns based on the recovery evaluation metrics of Recovery Time, Recovery Cost, and Recovery Impact. This quantitative evaluation will lead to selection of the acceptable recovery actions. We propose two recovery actions selection mechanisms: one is based on user constraints of the recovery evaluation metrics, and the other one is based on Pareto set searching algorithm. We implement a recovery service and illustrate its applicability by recovering from errors occurring in the rolling upgrade operation on AWS cloud.

[1]  Marsha Chechik,et al.  RuMoR: monitoring and recovery for BPEL applications , 2010, ASE '10.

[2]  Liming Zhu,et al.  POD-Diagnosis: Error Diagnosis of Sporadic Operations on Cloud Applications , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[3]  Tao Wang,et al.  Workload-aware anomaly detection for Web applications , 2014, J. Syst. Softw..

[4]  Xi He,et al.  Cloud Computing: a Perspective Study , 2010, New Generation Computing.

[5]  Artur Andrzejak,et al.  Monetary Cost-Aware Checkpointing and Migration on Amazon Cloud Spot Instances , 2012, IEEE Transactions on Services Computing.

[6]  Liming Zhu,et al.  Mechanisms and Architectures for Tail-Tolerant System Operations in Cloud , 2014, HotCloud.

[7]  Stefan Edelkamp,et al.  Automated Planning: Theory and Practice , 2007, Künstliche Intell..

[8]  Gordon J. Pace,et al.  Recovery within long-running transactions , 2013, CSUR.

[9]  Leonard J. Bass,et al.  Supporting Undoability in Systems Operations , 2013, LISA.

[10]  Geng Lin,et al.  Cloud Computing and IT as a Service: Opportunities and Challenges , 2008, 2008 IEEE Congress on Services Part II (services-2 2008).

[11]  Paolo Traverso,et al.  Automated planning - theory and practice , 2004 .

[12]  Tudor Dumitras,et al.  Why Do Upgrades Fail and What Can We Do about It? , 2009, Middleware.

[13]  Bernd Freisleben,et al.  Fault-Tolerant BPEL Workflow Execution via Cloud-Aware Recovery Policies , 2009, 2009 35th Euromicro Conference on Software Engineering and Advanced Applications.

[14]  Yuriy Brun,et al.  Supporting process undo and redo in software engineering decision making , 2013, ICSSP 2013.

[15]  Toby Velte,et al.  Cloud Computing, A Practical Approach , 2009 .

[16]  Arun Venkataramani,et al.  Disaster Recovery as a Cloud Service: Economic Benefits & Deployment Challenges , 2010, HotCloud.

[17]  Jez Humble,et al.  Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation , 2010 .

[18]  Armando Fox,et al.  Cheap recovery: a key to self-managing state , 2004, TOS.

[19]  Ellard T. Roush Cluster rolling upgrade using multiple version support , 2001, Proceedings 42nd IEEE Symposium on Foundations of Computer Science.

[20]  Liming Zhu,et al.  Runtime Recovery Actions Selection for Sporadic Operations on Cloud , 2015, 2015 24th Australasian Software Engineering Conference.

[21]  Liming Zhu,et al.  Cloud API issues: an empirical study and impact , 2013, QoSA '13.

[22]  Taghi M. Khoshgoftaar,et al.  Choosing software metrics for defect prediction: an investigation on feature selection techniques , 2011, Softw. Pract. Exp..