Dynamic monitoring, modeling and management of performance and resources for applications in the cloud

Emerging trends in Cloud computing bring numerous benefits, such as higher performance, fast and flexible provisioning of applications and capacities, lower infrastructure costs, and almost unlimited scalability. However, the increasing complexity of automated performance and resource management for applications in Cloud computing presents novel challenges that demand enhancement to classical control-based approaches. An important challenge that Cloud service providers often face is a resource sharing dilemma under workload variation. Cloud service providers pursue higher resource utilization, because the higher the utilization, the lower the hardware cost, operating cost and maintenance cost. On the other hand, resource utilizations cannot be too high or the service provider's revenue could be jeopardized due to the inability to meet application-level service-level objectives (SLOs). A crucial research question is how to generate as much revenue as possible by satisfying service-level agreements while reducing costs as much as possible in order to maximize the profit for Cloud service providers. To this end, the classical control-based approaches show great potential to address the resource sharing dilemma, which could be classified into three major categories, i.e., admission control, queueing and scheduling, and resource allocation. However, it is a challenging task to apply classical control-based approaches directly to computer systems, where first-principle models are generally not available. It becomes even more difficult due to the dynamics seen in real computer systems including workload variations, multi-tier dependencies, and resource bottleneck shifts. Fundamentally, the main contributions of this thesis are the efforts to enhance classical control-based approaches by leveraging other techniques to address the increasing complexity of automated performance and resource management in the Cloud through dynamic monitoring, modeling and management of performance and resources. More specifically, (1) an admission control approach is enhanced by leveraging decision theory to achieve the most profitable service-level compliance; (2) a critical resource identification approach is enhanced by leveraging statistical machine learning to automatically and adaptively identify critical resources; and (3) a resource allocation approach is enhanced by leveraging hierarchical resource management to achieve the highest resource utilization. Concretely, the enhanced control-based approaches are implemented in a collection of real control systems: ActiveSLA, vPerfGuard and ERController. The control systems are applied to different real applications, such as OLTP and OLAP database applications and distributed multi-tier web applications, with different workload intensities, type and mix, in different Cloud environments. All the experimental results show that the prototype control systems outperform existing classical control-based approaches. Finally, this thesis opens new avenues to address the increasing complexity of automated performance and resource management through enhancement of classical control-based approaches in Cloud environments. Future work will consistently follow the direction of new avenues to address the new challenges that arise with the advent of new hardware technology, new software frameworks and new computing paradigms.

[1]  C. McLean,et al.  A proposed hierarchical control model for automated manufacturing systems , 1986 .

[2]  Bruce A. Francis,et al.  Feedback Control Theory , 1992 .

[3]  Ling Liu,et al.  OTPM: Failure handling in data-intensive analytical processing , 2011, 7th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom).

[4]  Anja Gruenheid,et al.  Query optimization using column statistics in hive , 2011, IDEAS '11.

[5]  Calton Pu,et al.  Empirical analysis of database server scalability using an N-tier benchmark with read-intensive workload , 2010, SAC '10.

[6]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[7]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1972 .

[8]  Calton Pu,et al.  Study on performance management and application behavior in virtualized environment , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[9]  Erich M. Nahum,et al.  A method for transparent admission control and request scheduling in e-commerce web sites , 2004, WWW '04.

[10]  Jia Zhang,et al.  A router model for QoS-based multimedia Web services , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[11]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[12]  Yun Chi,et al.  iCBS: Incremental Costbased Scheduling under Piecewise Linear SLAs , 2011, Proc. VLDB Endow..

[13]  David E. Culler,et al.  USENIX Association Proceedings of USITS ’ 03 : 4 th USENIX Symposium on Internet Technologies and Systems , 2003 .

[14]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[15]  Panos J. Antsaklis,et al.  An introduction to intelligent and autonomous control , 1993 .

[16]  Alfons Kemper,et al.  Adaptive quality of service management for enterprise services , 2008, TWEB.

[17]  George E. P. Box,et al.  Empirical Model‐Building and Response Surfaces , 1988 .

[18]  Archana Ganapathi,et al.  Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[19]  Calton Pu,et al.  Economical and Robust Provisioning of N-Tier Cloud Workloads: A Multi-level Control Approach , 2011, 2011 31st International Conference on Distributed Computing Systems.

[20]  Pengcheng Xiong Dynamic management of resources and workloads for RDBMS in cloud: a control-theoretic approach , 2012, PhD '12.

[21]  David J. DeWitt,et al.  An evaluation of buffer management strategies for relational database systems , 1986, Algorithmica.

[22]  Tim Brecht,et al.  Q-Cop: Avoiding bad query mixes to minimize client timeouts under heavy loads , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[23]  David E. Irwin,et al.  Automated and on-demand provisioning of virtual machines for database applications , 2007, SIGMOD '07.

[24]  Edward D. Lazowska,et al.  Quantitative system performance - computer system analysis using queueing network models , 1983, Int. CMG Conference.

[25]  Adam Wierman,et al.  Open Versus Closed: A Cautionary Tale , 2006, NSDI.

[26]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[27]  Jie Liu,et al.  Cuanta: quantifying effects of shared on-chip resource interference for consolidated virtual machines , 2011, SoCC.

[28]  Shamkant B. Navathe,et al.  Two techniques for on-line index modification in shared nothing parallel databases , 1996, SIGMOD '96.

[29]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[30]  Andrew Warfield,et al.  Xen and the art of virtualization , 2003, SOSP '03.

[31]  Adam Wierman,et al.  How to Determine a Good Multi-Programming Level for External Scheduling , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[32]  William L. Brogan,et al.  Modern control theory (3rd ed.) , 1991 .

[33]  Armando Fox,et al.  Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.

[34]  John Wilkes,et al.  Profitable services in an uncertain world , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[35]  Xiaoyun Zhu,et al.  Adaptive entitlement control of resource containers on shared servers , 2005, 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005..

[36]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[37]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[38]  Calton Pu,et al.  Intelligent management of virtualized resources for database systems in cloud environment , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[39]  Erich M. Nahum,et al.  Yaksha: a self-tuning controller for managing the performance of 3-tiered Web sites , 2004, Twelfth IEEE International Workshop on Quality of Service, 2004. IWQOS 2004..

[40]  Jeffrey F. Naughton,et al.  Toward a progress indicator for database queries , 2004, SIGMOD '04.

[41]  Volker Markl,et al.  LEO - DB2's LEarning Optimizer , 2001, VLDB.

[42]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[43]  Calton Pu,et al.  Automated Staging for Built-to-Order Application Systems , 2006, 2006 IEEE/IFIP Network Operations and Management Symposium NOMS 2006.

[44]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[45]  N. Draper,et al.  Applied Regression Analysis , 1966 .

[46]  Michael I. Jordan,et al.  Statistical Machine Learning Makes Automatic Control Practical for Internet Datacenters , 2009, HotCloud.

[47]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[48]  David F. Pyke,et al.  Inventory management and production planning and scheduling , 1998 .

[49]  Xiaoyun Zhu,et al.  Utilization and SLO-Based Control for Dynamic Sizing of Resource Partitions , 2005, DSOM.

[50]  Jeffrey M. Voas,et al.  Cloud Computing: New Wine or Just a New Bottle? , 2009, IT Professional.

[51]  Úlfar Erlingsson,et al.  Fay: extensible distributed tracing from kernels to clusters , 2011, SOSP '11.

[52]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[53]  Lieven Eeckhout,et al.  Performance Metrics for Consolidated Servers , 2010, HiPC 2010.

[54]  Xing Pu,et al.  Performance Measurements and Analysis of Network I/O Applications in Virtualized Cloud , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[55]  Prashant J. Shenoy,et al.  Resource overbooking and application profiling in shared hosting platforms , 2002, OSDI '02.

[56]  Kang G. Shin,et al.  Automated control of multiple virtualized resources , 2009, EuroSys '09.

[57]  Xiaoyun Zhu,et al.  Memory overbooking and dynamic control of Xen virtual machines in consolidated environments , 2009, 2009 IFIP/IEEE International Symposium on Integrated Network Management.

[58]  Miron Livny,et al.  Priority in DBMS Resource Scheduling , 1989, VLDB.

[59]  Kaushik Dutta,et al.  Application performance modeling in a virtualized environment , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[60]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[61]  Kang G. Shin,et al.  What does control theory bring to systems research? , 2009, OPSR.

[62]  Mark S. Squillante,et al.  On maximizing service-level-agreement profits , 2001, PERV.

[63]  Calton Pu,et al.  The Impact of Soft Resource Allocation on n-Tier Application Scalability , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[64]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[65]  Yixin Diao,et al.  Using MIMO feedback control to enforce policies for interrelated metrics with application to the Apache Web server , 2002, NOMS 2002. IEEE/IFIP Network Operations and Management Symposium. ' Management Solutions for the New Communications World'(Cat. No.02CH37327).

[66]  Robert F. Stengel,et al.  Optimal Control and Estimation , 1994 .

[67]  Edward Omiecinski Parallel Relational Database Systems , 1995, Modern Database Systems.

[68]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[69]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[70]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[71]  Karl Johan Åström,et al.  Adaptive Control , 1989, Embedded Digital Control with Microcontrollers.

[72]  Xiaoyun Zhu,et al.  AppRAISE: application-level performance management in virtualized server environments , 2009, IEEE Transactions on Network and Service Management.

[73]  David E. Irwin,et al.  Balancing risk and reward in a market-based task service , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[74]  Hakan Hacigümüs,et al.  Providing database as a service , 2002, Proceedings 18th International Conference on Data Engineering.

[75]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[76]  Ashvin Goel,et al.  Fair and timely scheduling via cooperative polling , 2009, EuroSys '09.

[77]  Kamesh Munagala,et al.  Modeling and exploiting query interactions in database systems , 2008, CIKM '08.

[78]  Martin Arlitt,et al.  A workload characterization study of the 1998 World Cup Web site , 2000, IEEE Netw..

[79]  Calton Pu,et al.  ActiveSLA: a profit-oriented admission control framework for database-as-a-service providers , 2011, SoCC.

[80]  Patrick Valduriez,et al.  Principles of distributed database systems (2nd ed.) , 1999 .

[81]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[82]  Calton Pu,et al.  Detecting Bottleneck in n-Tier IT Applications Through Analysis , 2006, DSOM.

[83]  Xiaoyun Zhu,et al.  Triage: Performance differentiation for storage systems using adaptive control , 2005, TOS.

[84]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[85]  Asser N. Tantawi,et al.  An analytical model for multi-tier internet services and its applications , 2005, SIGMETRICS '05.

[86]  Christopher Stewart,et al.  Exploiting nonstationarity for performance prediction , 2007, EuroSys '07.