论文信息 - Towards building performance models for data-intensive workloads in public clouds

Towards building performance models for data-intensive workloads in public clouds

The cloud computing paradigm provides the "illusion" of infinite resources and, therefore, becomes a promising candidate for large-scale data-intensive computing. In this paper, we explore experiment-driven performance models for data-intensive workloads executing in an infrastructure-as-a-service (IaaS) public cloud. The performance models help in predicting the workload behaviour, and serve as a key component of a larger framework for resource provisioning in the cloud. We determine a suitable prediction technique after comparing popular regression methods. We also enumerate the variables that impact variance in the workload performance in a public cloud. Finally, we build a performance model for a multi-tenant data service in the Amazon cloud. We find that a linear classifier is sufficient in most cases. On a few occasions, a linear classifier is unsuitable and non-linear modeling is required, which is time consuming. Consequently, we recommend that a linear classifier be used in training the performance model in the first instance. If the resulting model is unsatisfactory, then non-linear modeling can be carried out in the next step.

[1] José Luis Vázquez-Poletti,et al. Provisioning data analytic workloads in a cloud , 2013, Future Gener. Comput. Syst..

[2] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[3] Jorge-Arnulfo Quiané-Ruiz,et al. Runtime measurements in the cloud , 2010, Proc. VLDB Endow..

[4] Terence Kelly,et al. Detecting Performance Anomalies in Global Applications , 2005, WORLDS.

[5] Shivnath Babu,et al. Predicting completion times of batch query workloads using interaction-aware models and simulation , 2011, EDBT/ICDT '11.

[6] Kimmo E. E. Raatikainen,et al. Cluster analysis and workload classification , 1993, PERV.

[7] Gregory R. Ganger,et al. Towards Self-Predicting Systems: What If You Could Ask "What-If"? , 2005, 16th International Workshop on Database and Expert Systems Applications (DEXA'05).

[8] Jiawei Han,et al. Data Mining: Concepts and Techniques , 2000 .

[9] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[10] Leonie Kohl,et al. Fundamental Concepts in the Design of Experiments , 2000 .

[11] Craig D. Weissman,et al. The design of the force.com multitenant internet application development platform , 2009, SIGMOD Conference.

[12] Tim Brecht,et al. Q-Cop: Avoiding bad query mixes to minimize client timeouts under heavy loads , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[13] Ivor W. Tsang,et al. Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[14] Ian Witten,et al. Data Mining , 2000 .

[15] C. Murray Woodside,et al. Using regression splines for software performance analysis , 2000, WOSP '00.

[16] Jian Pei,et al. 2012- Data Mining. Concepts and Techniques, 3rd Edition.pdf , 2012 .

[17] Archana Ganapathi,et al. Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[18] Chetan Gupta,et al. PQR: Predicting Query Execution Times for Autonomous Workload Management , 2008, 2008 International Conference on Autonomic Computing.

[19] Philip S. Yu,et al. Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[20] Gerhard Weikum,et al. Self-tuning Database Technology and Information Services: from Wishful Thinking to Viable Engineering , 2002, VLDB.

[21] Jason W. Osbourne,et al. Four Assumptions of Multiple Regression That Researchers Should Always Test. , 2002 .

[22] C. Ireland. Fundamental concepts in the design of experiments , 1964 .

[23] Andrew W. Moore,et al. X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[24] Chih-Jen Lin,et al. A Practical Guide to Support Vector Classication , 2008 .

[25] Ian H. Witten,et al. The WEKA data mining software: an update , 2009, SKDD.

[26] Shivnath Babu,et al. Query interactions in database workloads , 2009, DBTest '09.

[27] Patrick Martin,et al. Executing Data-Intensive Workloads in a Cloud , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[28] Patrick Martin,et al. Utility Function-based Workload Management for DBMSs , 2011 .

[29] Kamesh Munagala,et al. Modeling and exploiting query interactions in database systems , 2008, CIKM '08.

[30] J. Platt. Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[31] Peter Bumbulis,et al. Automatic tuning of the multiprogramming level in Sybase SQL Anywhere , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[32] Pascal Poupart,et al. A bayesian approach to online performance modeling for database appliances using gaussian models , 2011, ICAC '11.

[33] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[34] Qi Zhang,et al. R-Capriccio: A Capacity Planning and Anomaly Detection Tool for Enterprise Services with Live Workloads , 2007, Middleware.

[35] อนิรุธ สืบสิงห์,et al. Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[36] Patrick Martin,et al. Discovering Indicators for Congestion in DBMSs , 2012, 2012 IEEE 28th International Conference on Data Engineering Workshops.

[37] Jian Pei,et al. Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[38] Jason Weston,et al. A user's guide to support vector machines. , 2010, Methods in molecular biology.