Multi-objective Optimisation of Online Distributed Software Update for DevOps in Clouds

This article studies synchronous online distributed software update, also known as rolling upgrade in DevOps, which in clouds upgrades software versions in virtual machine instances even when various failures may occur. The goal is to minimise completion time, availability degradation, and monetary cost for entire rolling upgrade by selecting proper parameters. For this goal, we propose a stochastic model and a novel optimisation method. We validate our approach to minimise the objectives through both experiments in Amazon Web Service (AWS) and simulations.

[1]  Liming Zhu,et al.  R2C: Robust Rolling-Upgrade in Clouds , 2018, IEEE Trans. Dependable Secur. Comput..

[2]  W. Eric Wong,et al.  Insights on Fault Interference for Programs with Multiple Bugs , 2009, 2009 20th International Symposium on Software Reliability Engineering.

[3]  Ricardo Bianchini,et al.  Staged deployment in mirage, an integrated software upgrade testing and distribution system , 2007, SOSP.

[4]  John C. S. Lui,et al.  Stochastic Analysis on RAID Reliability for Solid-State Drives , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[5]  S. Radack The Common Vulnerability Scoring System (CVSS) , 2007 .

[6]  Alexandru Iosup,et al.  IaaS cloud benchmarking: approaches, challenges, and experience , 2013, HotTopiCS '13.

[7]  Liming Zhu,et al.  Multi-objective Optimisation of Rolling Upgrade Allowing for Failures in Clouds , 2015, 2015 IEEE 34th Symposium on Reliable Distributed Systems (SRDS).

[8]  Aman Kansal,et al.  Q-clouds: managing performance interference effects for QoS-aware clouds , 2010, EuroSys '10.

[9]  Vernon Rego Naive asymptotics for hitting time bounds in Markov chains , 2005, Acta Informatica.

[10]  Wei Sun,et al.  Hybrid Overloading and Stochastic Analysis for Redundant Real-time Multiprocessor Systems , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[11]  Tudor Dumitras,et al.  To upgrade or not to upgrade: impact of online upgrades across multiple administrative domains , 2010, OOPSLA.

[12]  Tudor Dumitras,et al.  Why Do Upgrades Fail and What Can We Do about It? , 2009, Middleware.

[13]  Liming Zhu,et al.  Statistically managing cloud operations for latency-tail-tolerance in IoT-enabled smart cities , 2019, J. Parallel Distributed Comput..

[14]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[15]  Ravishankar K. Iyer,et al.  Characterization of operational failures from a business data processing SaaS platform , 2014, ICSE Companion.

[16]  Liming Zhu,et al.  Non-Intrusive Anomaly Detection With Streaming Performance Metrics and Logs for DevOps in Public Clouds: A Case Study in AWS , 2016, IEEE Transactions on Emerging Topics in Computing.

[17]  Terry Williams,et al.  Probability and Statistics with Reliability, Queueing and Computer Science Applications , 1983 .

[18]  Liming Zhu,et al.  POD-Diagnosis: Error Diagnosis of Sporadic Operations on Cloud Applications , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[19]  Vernon Rego A Band and Bound Technique for Simple Random Algorithms , 1990, Performance.

[20]  Liuba Shrira,et al.  Modular Software Upgrades for Distributed Systems , 2006, ECOOP.

[21]  Leonard J. Bass,et al.  Rollup: Non-Disruptive Rolling Upgrade with Fast Consensus-Based Dynamic Reconfigurations , 2016, IEEE Transactions on Parallel and Distributed Systems.

[22]  Kishor S. Trivedi,et al.  Software Reliability and Testing Time Allocation: An Architecture-Based Approach , 2010, IEEE Transactions on Software Engineering.

[23]  Liming Zhu,et al.  Quantifying Failure Risk of Version Switch for Rolling Upgrade on Clouds , 2014, 2014 IEEE Fourth International Conference on Big Data and Cloud Computing.

[24]  Eric A. Brewer,et al.  Towards robust distributed systems (abstract) , 2000, PODC '00.

[25]  Vincent Gramoli,et al.  GLAP: Distributed Dynamic Workload Consolidation through Gossip-Based Learning , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[26]  Bastian Zimmer,et al.  A Systematic Approach for Software Interference Analysis , 2014, 2014 IEEE 25th International Symposium on Software Reliability Engineering.

[27]  Zhao Li,et al.  Evaluating Web software reliability based on workload and failure data extracted from server logs , 2004, IEEE Transactions on Software Engineering.

[28]  Sameer Ajmani,et al.  Automatic software upgrades for distributed systems , 2004 .

[29]  Jean-Michel Fourneau,et al.  Censoring Markov Chains and Stochastic Bounds , 2007, EPEW.