Cloud application deployment with transient failure recovery

Application deployment is a crucial operation for modern cloud providers. The ability to dynamically allocate resources and deploy a new application instance based on a user-provided description in a fully automated manner is of great importance for the cloud users as it facilitates the generation of fully reproducible application environments with minimum effort. However, most modern deployment solutions do not consider the error-prone nature of the cloud: Network glitches, bad synchronization between different services and other software or infrastructure related failures with transient characteristics are frequently encountered. Even if these failures may be tolerable during an application’s lifetime, during the deployment phase they can cause severe errors and lead it to failure. In order to tackle this challenge, in this work we propose AURA, an open source system that enables cloud application deployment with transient failure recovery capabilities. AURA formulates the application deployment as a Directed Acyclic Graph. Whenever a transient failure occurs, it traverses the graph, identifies the parts of it that failed and re-executes the respective scripts, based on the fact that when the transient failure disappears the script execution will succeed. Moreover, in order to guarantee that each script execution is idempotent, AURA adopts a lightweight filesystem snapshot mechanism that aims at canceling the side effects of the failed scripts. Our thorough evaluation indicated that AURA is capable of deploying diverse real-world applications to environments exhibiting high error probabilities, introducing a minimal time overhead, proportional to the failure probability of the deployment scripts.

[1]  Schahram Dustdar,et al.  Asserting reliable convergence for configuration management scripts , 2016, OOPSLA.

[2]  Florian Rosenberg,et al.  Testing Idempotence for Infrastructure as Code , 2013, Middleware.

[3]  Rahul Potharaju,et al.  When the network crumbles: an empirical study of cloud network failures and their impact on services , 2013, SoCC.

[4]  David Bernstein,et al.  Containers and Cloud: From LXC to Docker to Kubernetes , 2014, IEEE Cloud Computing.

[5]  Ewa Deelman,et al.  Wrangler: virtual cluster provisioning for the cloud , 2011, HPDC '11.

[6]  Tao Xie,et al.  Reliability Engineering , 2017, IEEE Softw..

[7]  Liming Zhu,et al.  DevOps - A Software Architect's Perspective , 2015, SEI series in software engineering.

[8]  Jacobus E. van der Merwe,et al.  Cloud Resource Orchestration: A Data-Centric Approach , 2011, CIDR.

[9]  Josef Bacik,et al.  BTRFS: The Linux B-Tree Filesystem , 2013, TOS.

[10]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[11]  Ioannis Konstantinou,et al.  AURA: Recovering from Transient Failures in Cloud Deployments , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[12]  Pankesh Patel,et al.  Service Level Agreement in Cloud Computing , 2009 .

[13]  Neha Jain,et al.  Apache CloudStack : Open Source Infrastructure as a Service Cloud Computing Platform , 2014 .

[14]  Ewa Deelman,et al.  Automating Application Deployment in Infrastructure Clouds , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[15]  Philip M. Papadopoulos,et al.  NPACI: rocks: tools and techniques for easily deploying manageable Linux clusters , 2001, Proceedings 42nd IEEE Symposium on Foundations of Computer Science.

[16]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[17]  Ioannis Konstantinou,et al.  Recovering from Cloud Application Deployment Failures Through Re-execution , 2016, ALGOCLOUD.

[18]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[19]  Yoji Yamato,et al.  Development of template management technology for easy deployment of virtual resources on OpenStack , 2014, Journal of Cloud Computing.

[20]  Philip Robinson,et al.  Dynamic Topology Orchestration for Distributed Cloud-Based Applications , 2012, 2012 Second Symposium on Network Cloud Computing and Applications.

[21]  Ioannis Konstantinou,et al.  CELAR: Automated application elasticity platform , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[22]  David Wolinsky,et al.  Heading Off Correlated Failures through Independence-as-a-Service , 2014, OSDI.

[23]  Yasuharu Katsuno,et al.  An Automated Parallel Approach for Rapid Deployment of Composite Application Servers , 2015, 2015 IEEE International Conference on Cloud Engineering.

[24]  John S. Heidemann,et al.  File-system development with stackable layers , 1994, TOCS.