Diagnosing, predicting and managing application performance in virtualised multi-tenant clouds

As the computing industry enters the cloud era, multicore architectures and virtualisation technologies are replacing traditional IT infrastructures for several reasons including reduced infrastructure costs, lower energy consumption and ease of management. Cloud-based software systems are expected to deliver reliable performance under dynamic workloads while efficiently allocating resources. However, with the increasing diversity and sophistication of the environment, managing performance of applications in such environments becomes difficult. The primary goal of this thesis is to gain insight into performance issues of applications running in clouds. This is achieved by a number of innovations with respect to the monitoring, modelling and managing of virtualised computing systems: (i) Monitoring – we develop a monitoring and resource control platform that, unlike early cloud benchmarking systems, enables service level objectives (SLOs) to be expressed graphically as Performance Trees; these source both live and historical data. (ii) Modelling – we develop stochastic models based on Queueing Networks and Markov chains for predicting the performance of applications in multicore virtualised computing systems. The key feature of our techniques is their ability to characterise performance bottlenecks effectively by modelling both the hypervisor and the hardware. (iii) Managing – through the integration of our benchmarking and modelling techniques with a novel interference-aware prediction model, adaptive on-line reconfiguration and resource control in virtualised environments become lightweight target-specific operations that do not require sophisticated pre-training or micro-benchmarking. The validation results show that our models are able to predict the expected scalability behaviour of CPU/network intensive applications running on virtualised multicore environments with relative errors of between 8 and 26%. We also show that our performance interference prediction model can capture a broad range of workloads efficiently, achieving an average error of 9% across different applications and setups. We implement this model in a private cloud deployment in our department, and we evaluate it using both synthetic benchmarks and real user applications. We also explore the applicability of our model to both hypervisor reconfiguration and resource scheduling. The hypervisor reconfiguration can improve network throughput by up to 30% while the interference-aware scheduler improves application performance by up to 10% compared to the default CloudStack scheduler.

[1]  K. Mani Chandy,et al.  Open, Closed, and Mixed Networks of Queues with Different Classes of Customers , 1975, JACM.

[2]  Samuel Kounev,et al.  LIMBO: a tool for modeling variable load intensities , 2014, ICPE.

[3]  Samuel Kounev,et al.  Evaluating and Modeling Virtualization Performance Overhead for Cloud Environments , 2011, CLOSER.

[4]  Ripduman Sohan,et al.  Shadow Kernels: A General Mechanism For Kernel Specialization in Existing Operating Systems , 2015, OPSR.

[5]  Peter G. Harrison,et al.  Uniformization and hypergraph partitioning for the distributed computation of response time densities in very large Markov models , 2004, J. Parallel Distributed Comput..

[6]  Xiaohui Gu,et al.  AGILE: Elastic Distributed Resource Scaling for Infrastructure-as-a-Service , 2013, ICAC.

[7]  Qian Zhu,et al.  A Performance Interference Model for Managing Consolidated Workloads in QoS-Aware Clouds , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[8]  Waheed Iqbal,et al.  SLA-Driven Adaptive Resource Management for Web Applications on a Heterogeneous Compute Cloud , 2009, CloudCom.

[9]  Peter G. Harrison,et al.  Using bulk arrivals to model I/O request response time distributions in zoned disks and RAID systems , 2009, VALUETOOLS.

[10]  Martin K. Purvis,et al.  Multi-core application performance optimization using a constrained tandem queueing model , 2011, J. Netw. Comput. Appl..

[11]  Benjamin Farley,et al.  More for your money: exploiting performance heterogeneity in public clouds , 2012, SoCC '12.

[12]  Jian Zhang,et al.  COSBench: cloud object storage benchmark , 2013, ICPE '13.

[13]  Vladimir Vlassov,et al.  Stay-Away, protecting sensitive applications from performance interference , 2014, Middleware.

[14]  Samuel Kounev,et al.  Performance queries for architecture-level performance models , 2014, ICPE.

[15]  Scott Shenker,et al.  E2: a framework for NFV applications , 2015, SOSP.

[16]  Chita R. Das,et al.  D-factor: a quantitative model of application slow-down in multi-resource shared systems , 2012, SIGMETRICS '12.

[17]  Tim Brecht,et al.  Comparing high-performance multi-core web-server architectures , 2012, SYSTOR '12.

[18]  Pietro Piazzolla,et al.  End-to-End Performance of Multi-core Systems in Cloud Environments , 2013, EPEW.

[19]  Sing Kwong Cheung,et al.  Processor-sharing queues and resource sharing in wireless LANs , 2007 .

[20]  Peter G. Harrison,et al.  Performance modelling of communication networks and computer architectures , 1992, International computer science series.

[21]  Ieee Staff 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) , 2013 .

[22]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[23]  Asit K. Mishra,et al.  METE: meeting end-to-end QoS in multicores through system-wide resource management , 2011, PERV.

[24]  Patrick Wendell,et al.  Sparrow: distributed, low latency scheduling , 2013, SOSP.

[25]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[26]  Jie Liu,et al.  Cuanta: quantifying effects of shared on-chip resource interference for consolidated virtual machines , 2011, SoCC.

[27]  Niklas Carlsson,et al.  Improving the scalability of a multi-core web server , 2013, ICPE '13.

[28]  Xiaohui Gu,et al.  CloudScale: elastic resource scaling for multi-tenant cloud systems , 2011, SoCC.

[29]  A. K. Erlang The theory of probabilities and telephone conversations , 1909 .

[30]  D. Kendall Stochastic Processes Occurring in the Theory of Queues and their Analysis by the Method of the Imbedded Markov Chain , 1953 .

[31]  A. Rowstron,et al.  Towards predictable datacenter networks , 2011, SIGCOMM.

[32]  Saikat Guha,et al.  Generalized resource allocation for the cloud , 2012, SoCC '12.

[33]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[34]  Angela Demke Brown,et al.  Opportunistic storage maintenance , 2015, SOSP.

[35]  Martin Kleppmann Making Sense of Stream Processing , 2016 .

[36]  Yong Meng Teo,et al.  On understanding the energy consumption of ARM-based multicore servers , 2013, SIGMETRICS '13.

[37]  Thomas F. Wenisch,et al.  The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[38]  Hyun-Wook Jin,et al.  MiAMI: Multi-core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces , 2009, 2009 17th IEEE Symposium on High Performance Interconnects.

[39]  Giuseppe Serazzi,et al.  What to expect when you are consolidating: effective prediction models of application performance on multicores , 2013, Cluster Computing.

[40]  Rolf Stadler,et al.  Dynamic resource allocation with management objectives—Implementation for an OpenStack cloud , 2012, 2012 8th international conference on network and service management (cnsm) and 2012 workshop on systems virtualiztion management (svm).

[41]  Jian Li,et al.  Performance Enhancement for Network I/O Virtualization with Efficient Interrupt Coalescing and Virtual Receive-Side Scaling , 2013, IEEE Transactions on Parallel and Distributed Systems.

[42]  Calton Pu,et al.  Performance Overhead among Three Hypervisors: An Experimental Study Using Hadoop Benchmarks , 2013, 2013 IEEE International Congress on Big Data.

[43]  André van Hoorn,et al.  Model-driven online capacity management for component-based software systems , 2014, Softwaretechnik-Trends.

[44]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[45]  Maria Kihl,et al.  Web server performance modeling using an M/G/1/K*PS queue , 2003, 10th International Conference on Telecommunications, 2003. ICT 2003..

[46]  Mor Harchol-Balter,et al.  TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters , 2016, EuroSys.

[47]  Pawel Gepner,et al.  Multi-Core Processors: New Way to Achieve High System Performance , 2006, PARELEC.

[48]  Ada Gavrilovska,et al.  Merlin: Application- and Platform-aware Resource Allocation in Consolidated Server Systems , 2014, SoCC.

[49]  Samuel Kounev,et al.  Predictive performance modeling of virtualized storage systems using optimized statistical regression techniques , 2013, ICPE '13.

[50]  Manish Jain,et al.  Effects of Interrupt Coalescence on Network Measurements , 2004, PAM.

[51]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[52]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[53]  Shuang Wu,et al.  Virtual Machine Based Energy-Efficient Data Center Architecture for Cloud Computing: A Performance Perspective , 2010, 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing.

[54]  Ludmila Cherkasova,et al.  Measuring CPU Overhead for I/O Processing in the Xen Virtual Machine Monitor , 2005, USENIX ATC, General Track.

[55]  William J. Knottenbelt,et al.  Towards a monitoring feedback loop for cloud applications , 2013, MultiCloud '13.

[56]  Walter Binder,et al.  Parallelism profiling and wall-time prediction for multi-threaded applications , 2013, ICPE '13.

[57]  Nicholas J. Dingle,et al.  PIPE2: a tool for the performance evaluation of generalised stochastic Petri Nets , 2009, PERV.

[58]  Aman Kansal,et al.  Q-clouds: managing performance interference effects for QoS-aware clouds , 2010, EuroSys '10.

[59]  Luca Faust,et al.  Modern Operating Systems , 2016 .

[60]  Israel Cidon,et al.  The power of prediction: cloud bandwidth and cost reduction , 2011, SIGCOMM.

[61]  Antonio Corradi,et al.  VM consolidation: A real case based on OpenStack Cloud , 2014, Future Gener. Comput. Syst..

[62]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[63]  Hossein Pishro-Nik,et al.  Introduction to Probability, Statistics, and Random Processes , 2014 .

[64]  H. Howie Huang,et al.  Matrix: Achieving Predictable Virtual Machine Performance in the Clouds , 2014, ICAC.

[65]  Diwakar Krishnamurthy,et al.  A Model of Storage I/O Performance Interference in Virtualized Systems , 2011, 2011 31st International Conference on Distributed Computing Systems Workshops.

[66]  Feng Wang,et al.  A deep investigation into network performance in virtual machine based cloud environments , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[67]  Xi Chen,et al.  CloudScope: Diagnosing and Managing Performance Interference in Multi-tenant Clouds , 2015, 2015 IEEE 23rd International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[68]  Yaozu Dong,et al.  Virtualization challenges: a view from server consolidation perspective , 2012, VEE '12.

[69]  Karsten Schwan,et al.  An analysis of power reduction in datacenters using heterogeneous chip multiprocessors , 2011, PERV.

[70]  J. M. Harrison,et al.  On the Quasireversibility of a Multiclass Brownian Service Station , 1990 .

[71]  Marcos K. Aguilera,et al.  Yesquel: scalable sql storage for web applications , 2014, SOSP.

[72]  Daniel A. Menascé,et al.  Analytic Models of Applications in Multi-core Computers , 2013, 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems.

[73]  Peter G. Harrison,et al.  Understanding, modelling, and improving the performance of web applications in multicore virtualised environments , 2014, ICPE.

[74]  Andy Hopper,et al.  Predicting the Performance of Virtual Machine Migration , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[75]  Wenji Wu,et al.  The performance analysis of linux networking - Packet receiving , 2007, Comput. Commun..

[76]  Lei Ying,et al.  A throughput optimal algorithm for map task scheduling in mapreduce with data locality , 2013, PERV.

[77]  Xing Pu,et al.  Performance Measurements and Analysis of Network I/O Applications in Virtualized Cloud , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[78]  Prashant J. Shenoy,et al.  Empirical evaluation of latency-sensitive application performance in the cloud , 2010, MMSys '10.

[79]  Xiao Zhang,et al.  CPI2: CPU performance isolation for shared compute clusters , 2013, EuroSys '13.

[80]  Bryan Veal,et al.  Performance scalability of a multi-core web server , 2007, ANCS '07.

[81]  Antony I. T. Rowstron,et al.  IOFlow: a software-defined storage architecture , 2013, SOSP.

[82]  Daniel A. Menascé,et al.  Analytic Performance Modeling and Optimization of Live VM Migration , 2013, EPEW.

[83]  Jeremy T. Bradley,et al.  Performance Trees: A New Approach to Quantitative Performance Specification , 2006, 14th IEEE International Symposium on Modeling, Analysis, and Simulation.

[84]  Shin Gyu Kim,et al.  Virtual machine consolidation based on interference modeling , 2013, The Journal of Supercomputing.

[85]  Calton Pu,et al.  Who Is Your Neighbor: Net I/O Performance Interference in Virtualized Clouds , 2013, IEEE Transactions on Services Computing.

[86]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[87]  Irfan Ahmad,et al.  Pesto: online storage performance management in virtualized datacenters , 2011, SoCC.

[88]  Tamas Suto Performance Trees : A Query Specification Formalism For Quantitative Performance Analysis , 2009 .

[89]  Tipp Moseley,et al.  Measuring interference between live datacenter applications , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[90]  Minlan Yu,et al.  FlowTags: enforcing network-wide policies in the presence of dynamic middlebox actions , 2013, HotSDN '13.

[91]  Andrew Warfield,et al.  Characterizing Storage Workloads with Counter Stacks , 2014, OSDI.

[92]  Ravi Iyer,et al.  Modeling virtual machine performance: challenges and approaches , 2010, PERV.

[93]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[94]  Srikanth Kandula,et al.  CloudProphet: towards application performance prediction in cloud , 2011, SIGCOMM 2011.

[95]  Carsten Binnig,et al.  How is the weather tomorrow?: towards a benchmark for the cloud , 2009, DBTest '09.

[96]  Anant Agarwal,et al.  An operating system for multicore and clouds: mechanisms and implementation , 2010, SoCC '10.

[97]  Brian D. Noble,et al.  Small is better: avoiding latency traps in virtualized data centers , 2013, SoCC.

[98]  Eyal de Lara,et al.  Non-intrusive, out-of-band and out-of-the-box systems monitoring in the cloud , 2014, SIGMETRICS '14.

[99]  Hitesh Ballani,et al.  End-to-end Performance Isolation Through Virtual Datacenters , 2014, OSDI.

[100]  William H. Sanders,et al.  The Mobius modeling tool , 2001, Proceedings 9th International Workshop on Petri Nets and Performance Models.

[101]  Amin Vahdat,et al.  Enforcing Performance Isolation Across Virtual Machines in Xen , 2006, Middleware.

[102]  Leonard Kleinrock,et al.  Time-shared Systems: a theoretical treatment , 1967, JACM.

[103]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[104]  William J. Knottenbelt,et al.  A Performance Tree-based Monitoring Platform for Clouds , 2015, ICPE.

[105]  Seungmin Kang,et al.  Towards workload-aware virtual machine consolidation on cloud platforms , 2012, ICUIMC.

[106]  Willy Zwaenepoel,et al.  Diagnosing performance overheads in the xen virtual machine environment , 2005, VEE '05.

[107]  Shang Gao,et al.  Optimizing virtual machines using hybrid virtualization , 2011, J. Syst. Softw..

[108]  Barry Hilary Valentine Topping,et al.  Parallel, distributed and grid computing for engineering , 2009 .

[109]  Amin Vahdat,et al.  Dynamic Scheduling of Virtual Machines Running HPC Workloads in Scientific Grids , 2007, 2009 3rd International Conference on New Technologies, Mobility and Security.

[110]  Peter G. Harrison,et al.  A unified approach to modelling the performance of concurrent systems , 2009, Simul. Model. Pract. Theory.

[111]  Leonard Kleinrock,et al.  Analysis of A time‐shared processor , 1964 .

[112]  Jeffrey C. Mogul,et al.  NetLord: a scalable multi-tenant network architecture for virtualized datacenters , 2011, SIGCOMM.

[113]  Jules-Raymond Tapamo,et al.  An Analytic Model for Predicting the Performance of Distributed Applications on Multicore Clusters , 2012 .

[114]  Nicholas J. Dingle,et al.  Performance Trees: Implementation And Distributed Evaluation , 2008 .

[115]  Cheng-Zhong Xu,et al.  Interference and locality-aware task scheduling for MapReduce applications in virtual clusters , 2013, HPDC.

[116]  Samuel Kounev,et al.  Evaluating Approaches for Performance Prediction in Virtualized Environments , 2013, 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems.

[117]  K. Leung,et al.  Dynamic Service Migration and Workload Scheduling in Micro-Clouds , 2015 .

[118]  Chris Douglas,et al.  Walnut: a unified cloud object store , 2012, SIGMOD Conference.

[119]  Prashant J. Shenoy,et al.  Provisioning multi-tier cloud applications using statistical bounds on sojourn time , 2012, ICAC '12.

[120]  Simon S. Lam,et al.  Queuing Networks with Population Size Constraints , 1977, IBM J. Res. Dev..

[121]  Robert L. Grossman,et al.  Malstone: towards a benchmark for analytics on large data clouds , 2010, KDD '10.

[122]  Jie Liu,et al.  PACMan: Performance Aware Virtual Machine Consolidation , 2013, ICAC.

[123]  Robbert van Renesse,et al.  An analysis of Facebook photo caching , 2013, SOSP.

[124]  Tommaso Cucinotta,et al.  The effects of scheduling, workload type and consolidation scenarios on virtual machine performance and their prediction through optimized artificial neural networks , 2011, J. Syst. Softw..

[125]  Peter G. Harrison,et al.  Turning back time in Markovian process algebra , 2003, Theor. Comput. Sci..

[126]  Long Wang,et al.  Towards an Understanding of Oversubscription in Cloud , 2012, Hot-ICE.

[127]  Gregory R. Ganger,et al.  alsched: algebraic scheduling of mixed workloads in heterogeneous clouds , 2012, SoCC '12.