Capelin: Data-Driven Compute Capacity Procurement for Cloud Datacenters Using Portfolios of Scenarios

Cloud datacenters provide a backbone to our digital society. Inaccurate capacity procurement for cloud datacenters can lead to significant performance degradation, denser targets for failure, and unsustainable energy consumption. Although this activity is core to improving cloud infrastructure, relatively few comprehensive approaches and support tools exist for mid-tier operators, leaving many planners with merely rule-of-thumb judgement. We derive requirements from a unique survey of experts in charge of diverse datacenters in several countries. We propose Capelin, a data-driven, scenario-based capacity planning system for mid-tier cloud datacenters. Capelin introduces the notion of portfolios of scenarios, which it leverages in its probing for alternative capacity-plans. At the core of the system, a trace-based, discrete-event simulator enables the exploration of different possible topologies, with support for scaling the volume, variety, and velocity of resources, and for horizontal (scale-out) and vertical (scale-up) scaling. Capelin compares alternative topologies and for each gives detailed quantitative operational information, which could facilitate human decisions of capacity planning. We implement and open-source Capelin, and show through comprehensive trace-based experiments it can aid practitioners. The results give evidence that reasonable choices can be worse by a factor of 1.5-2.0 than the best, in terms of performance degradation or energy consumption.

[1]  Ricardo Bianchini,et al.  Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms , 2017, SOSP.

[2]  Jerome A. Rolia,et al.  A capacity management service for resource pools , 2005, WOSP '05.

[3]  Niall Murphy,et al.  Site Reliability Engineering: How Google Runs Production Systems , 2016 .

[4]  Alexandru Iosup,et al.  The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems , 2013, J. Parallel Distributed Comput..

[5]  Tim Browning Capacity Planning for Computer Systems , 1994 .

[6]  John Allspaw,et al.  The Art of Capacity Planning: Scaling Web Resources , 2008 .

[7]  Kuang-Ching Wang,et al.  The Design and Operation of CloudLab , 2019, USENIX Annual Technical Conference.

[8]  Alexandru Iosup,et al.  Capelin: Data-Driven Capacity Procurement for Cloud Datacenters using Portfolios of Scenarios - Extended Technical Report , 2021, ArXiv.

[9]  Ada Gavrilovska,et al.  Practical Compute Capacity Management for Virtualized Datacenters , 2013, IEEE Transactions on Cloud Computing.

[10]  Jesús Carretero,et al.  iCanCloud: A Flexible and Scalable Cloud Infrastructure Simulator , 2012, Journal of Grid Computing.

[11]  Virgílio A. F. Almeida,et al.  Capacity Planning for Web Services: Metrics, Models, and Methods , 2001 .

[12]  Rajkumar Buyya,et al.  CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms , 2011, Softw. Pract. Exp..

[13]  Vijay Sukthankar,et al.  An optimized capacity planning approach for virtual infrastructure exhibiting stochastic workload , 2010, SAC '10.

[14]  Mor Harchol-Balter,et al.  Performance Modeling and Design of Computer Systems: Queueing Theory in Action , 2013 .

[15]  Shahram Sarkani,et al.  Knowledge based data center capacity reduction using sensitivity analysis on causal Bayesian belief network , 2013, Inf. Knowl. Syst. Manag..

[16]  Parthasarathy Ranganathan,et al.  The Datacenter as a Computer: Designing Warehouse-Scale Machines, Third Edition , 2018, The Datacenter as a Computer.

[17]  Joe Bauer,et al.  Latent Effects of Cloud Computing on IT Capacity Management Structures , 2017 .

[18]  D. Turner Qualitative Interview Design: A Practical Guide for Novice Investigators , 2010 .

[19]  Jae Choi,et al.  Capacity Management for Cloud Computing: A System Dynamics Approach , 2017, AMCIS.

[20]  Albert G. Greenberg,et al.  The cost of a cloud: research problems in data center networks , 2008, CCRV.

[21]  Shui F. Lam,et al.  Computer Capacity Planning: Theory and Practice , 1987 .

[22]  Rizos Sakellariou,et al.  Performance-Based Pricing in Multi-Core Geo-Distributed Cloud Computing , 2020, IEEE Transactions on Cloud Computing.

[23]  Qi Zhang,et al.  R-Capriccio: A Capacity Planning and Anomaly Detection Tool for Enterprise Services with Live Workloads , 2007, Middleware.

[24]  Daniel A. Menascé,et al.  Capacity planning for IaaS cloud providers offering multiple service classes , 2017, Future Gener. Comput. Syst..

[25]  Calton Pu,et al.  An Analysis of Performance Interference Effects in Virtual Environments , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[26]  Minlan Yu,et al.  Risk based planning of network changes in evolving data centers , 2019, SOSP.

[27]  Wencong Xiao,et al.  Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads , 2019, USENIX Annual Technical Conference.

[28]  Robert Birke,et al.  Failure Analysis of Virtual and Physical Machines: Patterns, Causes and Characteristics , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[29]  Jie Xu,et al.  Optimal Pricing and Capacity Planning of a New Economy Cloud Computing Service Class , 2015, 2015 International Conference on Cloud and Autonomic Computing.

[30]  Gregory R. Ganger,et al.  On the diversity of cluster workloads and its impact on research results , 2018, USENIX Annual Technical Conference.

[31]  Bianca Schroeder,et al.  Learning from Failure Across Multiple Clusters: A Trace-Driven Approach to Understanding, Predicting, and Mitigating Job Terminations , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[32]  Alexandru Iosup,et al.  Massivizing Computer Systems: A Vision to Understand, Design, and Engineer Computer Ecosystems Through and Beyond Modern Distributed Systems , 2018, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).

[33]  Alexandru Iosup,et al.  Statistical Characterization of Business-Critical Workloads Hosted in Cloud Datacenters , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[34]  Ruzica Piskac,et al.  An auditing language for preventing correlated failures in the cloud , 2017, Proc. ACM Program. Lang..

[35]  Chun Zhang,et al.  An Optimal Capacity Planning Algorithm for Provisioning Cluster-Based Failure-Resilient Composite Services , 2009, 2009 IEEE International Conference on Services Computing.

[36]  Kavita Guliani,et al.  Capacity Planning , 2015, login Usenix Mag..

[37]  Alexandru Iosup,et al.  A Reference Architecture for Datacenter Scheduling: Design, Validation, and Experiments , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[38]  Rouven Krebs,et al.  Metrics and techniques for quantifying performance isolation in cloud environments , 2012, QoSA '12.

[39]  Alexandru Iosup,et al.  A Model for Space-Correlated Failures in Large-Scale Distributed Systems , 2010, Euro-Par.

[40]  Kishor S. Trivedi,et al.  Stochastic Model Driven Capacity Planning for an Infrastructure-as-a-Service Cloud , 2014, IEEE Transactions on Services Computing.

[41]  Semih Salihoglu,et al.  Response to “Scale Up or Scale Out for Graph Processing” , 2018, IEEE Internet Computing.

[42]  Anand Sivasubramaniam,et al.  Worth their watts? - an empirical study of datacenter servers , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[43]  Xiaohui Gu,et al.  CloudScale: elastic resource scaling for multi-tenant cloud systems , 2011, SoCC.

[44]  Jörg Domaschka,et al.  Reliable capacity provisioning for distributed cloud/edge/fog computing applications , 2017, 2017 European Conference on Networks and Communications (EuCNC).

[45]  Alexandru Iosup,et al.  Self-Expressive Management of Business-Critical Workloads in Virtualized Datacenters , 2015, Computer.

[46]  Alexandru Iosup,et al.  The OpenDC Vision: Towards Collaborative Datacenter Simulation and Exploration for Everybody , 2017, 2017 16th International Symposium on Parallel and Distributed Computing (ISPDC).

[47]  Hao Chen,et al.  Joint Pricing and Capacity Planning in the IaaS Cloud Market , 2017, IEEE Transactions on Cloud Computing.

[48]  Takahiro Hirofuchi,et al.  SimGrid VM: Virtual Machine Support for a Simulation Framework of Distributed Systems , 2018, IEEE Transactions on Cloud Computing.

[49]  Alexandru Iosup,et al.  A CPU Contention Predictor for Business-Critical Workloads in Cloud Datacenters , 2019, 2019 IEEE 4th International Workshops on Foundations and Applications of Self* Systems (FAS*W).

[50]  Alexandru Iosup,et al.  The Grid Workloads Archive , 2008, Future Gener. Comput. Syst..

[51]  Long Wang,et al.  Towards an Understanding of Oversubscription in Cloud , 2012, Hot-ICE.

[52]  Rory V. O'Connor,et al.  Using grounded theory to understand software process improvement: A study of Irish software product companies , 2007, Inf. Softw. Technol..

[53]  Alexandru Iosup,et al.  OpenDC 2.0: Convenient Modeling and Simulation of Emerging Technologies in Cloud Datacenters , 2021, 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid).

[54]  Franck Cappello,et al.  Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed , 2006, Int. J. High Perform. Comput. Appl..

[55]  Mark Chamness,et al.  Capacity forecasting in a backup storage environment , 2011 .