Towards building performance models for data-intensive workloads in public clouds

The cloud computing paradigm provides the "illusion" of infinite resources and, therefore, becomes a promising candidate for large-scale data-intensive computing. In this paper, we explore experiment-driven performance models for data-intensive workloads executing in an infrastructure-as-a-service (IaaS) public cloud. The performance models help in predicting the workload behaviour, and serve as a key component of a larger framework for resource provisioning in the cloud. We determine a suitable prediction technique after comparing popular regression methods. We also enumerate the variables that impact variance in the workload performance in a public cloud. Finally, we build a performance model for a multi-tenant data service in the Amazon cloud. We find that a linear classifier is sufficient in most cases. On a few occasions, a linear classifier is unsuitable and non-linear modeling is required, which is time consuming. Consequently, we recommend that a linear classifier be used in training the performance model in the first instance. If the resulting model is unsatisfactory, then non-linear modeling can be carried out in the next step.

[1]  José Luis Vázquez-Poletti,et al.  Provisioning data analytic workloads in a cloud , 2013, Future Gener. Comput. Syst..

[2]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[3]  Jorge-Arnulfo Quiané-Ruiz,et al.  Runtime measurements in the cloud , 2010, Proc. VLDB Endow..

[4]  Terence Kelly,et al.  Detecting Performance Anomalies in Global Applications , 2005, WORLDS.

[5]  Shivnath Babu,et al.  Predicting completion times of batch query workloads using interaction-aware models and simulation , 2011, EDBT/ICDT '11.

[6]  Kimmo E. E. Raatikainen,et al.  Cluster analysis and workload classification , 1993, PERV.

[7]  Gregory R. Ganger,et al.  Towards Self-Predicting Systems: What If You Could Ask "What-If"? , 2005, 16th International Workshop on Database and Expert Systems Applications (DEXA'05).

[8]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[9]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[10]  Leonie Kohl,et al.  Fundamental Concepts in the Design of Experiments , 2000 .

[11]  Craig D. Weissman,et al.  The design of the force.com multitenant internet application development platform , 2009, SIGMOD Conference.

[12]  Tim Brecht,et al.  Q-Cop: Avoiding bad query mixes to minimize client timeouts under heavy loads , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[13]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[14]  Ian Witten,et al.  Data Mining , 2000 .

[15]  C. Murray Woodside,et al.  Using regression splines for software performance analysis , 2000, WOSP '00.

[16]  Jian Pei,et al.  2012- Data Mining. Concepts and Techniques, 3rd Edition.pdf , 2012 .

[17]  Archana Ganapathi,et al.  Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[18]  Chetan Gupta,et al.  PQR: Predicting Query Execution Times for Autonomous Workload Management , 2008, 2008 International Conference on Autonomic Computing.

[19]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[20]  Gerhard Weikum,et al.  Self-tuning Database Technology and Information Services: from Wishful Thinking to Viable Engineering , 2002, VLDB.

[21]  Jason W. Osbourne,et al.  Four Assumptions of Multiple Regression That Researchers Should Always Test. , 2002 .

[22]  C. Ireland Fundamental concepts in the design of experiments , 1964 .

[23]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[24]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[25]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[26]  Shivnath Babu,et al.  Query interactions in database workloads , 2009, DBTest '09.

[27]  Patrick Martin,et al.  Executing Data-Intensive Workloads in a Cloud , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[28]  Patrick Martin,et al.  Utility Function-based Workload Management for DBMSs , 2011 .

[29]  Kamesh Munagala,et al.  Modeling and exploiting query interactions in database systems , 2008, CIKM '08.

[30]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[31]  Peter Bumbulis,et al.  Automatic tuning of the multiprogramming level in Sybase SQL Anywhere , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[32]  Pascal Poupart,et al.  A bayesian approach to online performance modeling for database appliances using gaussian models , 2011, ICAC '11.

[33]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[34]  Qi Zhang,et al.  R-Capriccio: A Capacity Planning and Anomaly Detection Tool for Enterprise Services with Live Workloads , 2007, Middleware.

[35]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[36]  Patrick Martin,et al.  Discovering Indicators for Congestion in DBMSs , 2012, 2012 IEEE 28th International Conference on Data Engineering Workshops.

[37]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[38]  Jason Weston,et al.  A user's guide to support vector machines. , 2010, Methods in molecular biology.