Keep it moving: Proactive workload management for reducing SLA violations in large scale SaaS clouds

Software failures, workload-related failures and job overload conditions bring about SLA violations in software-as-a-service (SaaS) systems. Existing work does not address mitigation of SLA violations completely as (i) none of them address mitigation of SLA violations in business specific scenarios (SaaS, in our case), (ii) while some do not address software and workload-related failures, other approaches do not address the problem of target PM selection for workload migration comprehensively (leaving out vital considerations like workload compatibility checks between migrating VM and VMs at the target PM) and (iii) a clear mathematical mapping between workload, resource demand and SLA is lacking. In this paper, we present the Keep It Moving (KIM) software framework for the cloud controller that helps minimize service failures due to SLA violation of availability, utilization and response time in SaaS cloud data centers. Though we consider migration to be the primary mitigation technique, we also try to mitigate SLA violations without migration. We achieve this by performing a capacity check on the host physical machine (PM) before the migration to identify if enough capacity is available on the current PM to address the upcoming SLA violations by restart/reboot or VM resizing. In certain cases such as workload-related failures due to corrupt files, we prefer workload rerouting to a replica VM over migration. We formulate the selection of a target PM as a multi-objective optimization problem. We validate our proposed approach by using a trace-based discrete event simulation of a virtualized data center where failure and workload characteristics are simulated from data extracted from a real SaaS business server logs. We found that a 60% reduction in SLA violation is possible using our approach as well as reducing VM downtime by approximately 10%.

[1]  Rohit Gupta,et al.  A Two Stage Heuristic Algorithm for Solving the Server Consolidation Problem with Item-Item and Bin-Item Incompatibility Constraints , 2008, 2008 IEEE International Conference on Services Computing.

[2]  Arpan Roy,et al.  Reducing service failures by failure and workload aware load balancing in SaaS clouds , 2013, 2013 43rd Annual IEEE/IFIP Conference on Dependable Systems and Networks Workshop (DSN-W).

[3]  Lionel Eyraud-Dubois,et al.  Optimizing Resource allocation while handling SLA violations in Cloud Computing platforms , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[4]  Arun Venkataramani,et al.  Sandpiper: Black-box and gray-box resource management for virtual machines , 2009, Comput. Networks.

[5]  Ravishankar K. Iyer,et al.  Checkpointing virtual machines against transient errors , 2010, 2010 IEEE 16th International On-Line Testing Symposium.

[6]  Catello Di Martino One Size Does Not Fit All: Clustering Supercomputer Failures Using a Multiple Time Window Approach , 2013, ISC.

[7]  William H. Sanders,et al.  Stochastic Activity Networks: Formal Definitions and Concepts , 2002, European Educational Forum: School on Formal Methods and Performance Analysis.

[8]  Jing Xu,et al.  On the Use of Fuzzy Modeling in Virtualized Data Center Management , 2007, Fourth International Conference on Autonomic Computing (ICAC'07).

[9]  Zhenhuan Gong,et al.  PRESS: PRedictive Elastic ReSource Scaling for cloud systems , 2010, 2010 International Conference on Network and Service Management.

[10]  Domenico Cotroneo,et al.  Assessing time coalescence techniques for the analysis of supercomputer logs , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[11]  Domenico Cotroneo,et al.  Automated Generation of Performance and Dependability Models for the Assessment of Wireless Sensor Networks , 2012, IEEE Transactions on Computers.

[12]  Gargi Dasgupta,et al.  Server Workload Analysis for Power Minimization using Consolidation , 2009, USENIX Annual Technical Conference.

[13]  J. Little A Proof for the Queuing Formula: L = λW , 1961 .

[14]  Rajeshwari Ganesan,et al.  Analysis of SaaS Business Platform Workloads for Sizing and Collocation , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[15]  Edward A. Lee,et al.  Heterogeneous Concurrent Modeling and Design in Java (Volume 1: Introduction to Ptolemy II) , 2008 .

[16]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[17]  Alice E. Smith,et al.  Multi-objective tabu search using a multinomial probability mass function , 2006, Eur. J. Oper. Res..

[18]  William H. Sanders,et al.  The Mobius modeling tool , 2001, Proceedings 9th International Workshop on Petri Nets and Performance Models.

[19]  Jim Hoskins,et al.  Exploring IBM eserver iSeries and AS/400e Computers: The Instant Insider's Guide to IBM's Popular Mid-Range Computer Family , 2000 .

[20]  Gautam Kar,et al.  Application Performance Management in Virtualized Server Environments , 2006, 2006 IEEE/IFIP Network Operations and Management Symposium NOMS 2006.

[21]  Kishor S. Trivedi,et al.  Fighting bugs: remove, retry, replicate, and rejuvenate , 2007, Computer.

[22]  Xiaohui Gu,et al.  CloudScale: elastic resource scaling for multi-tenant cloud systems , 2011, SoCC.