Service Clustering for Autonomic Clouds Using Random Forest

Managing and optimising cloud services is one of the main challenges faced by industry and academia. A possible solution is resorting to self-management, as fostered by autonomic computing. However, the abstraction layer provided by cloud computing obfuscates several details of the provided services, which, in turn, hinders the effectiveness of autonomic managers. Data-driven approaches, particularly those relying on service clustering based on machine learning techniques, can assist the autonomic management and support decisions concerning, for example, the scheduling and deployment of services. One aspect that complicates this approach is that the information provided by the monitoring contains both continuous (e.g. CPU load) and categorical (e.g. VM instance type) data. Current approaches treat this problem in a heuristic fashion. This paper, instead, proposes an approach, which uses all kinds of data and learns in a data-driven fashion the similarities and resource usage patterns among the services. In particular, we use an unsupervised formulation of the Random Forest algorithm to calculate similarities and provide them as input to a clustering algorithm. For the sake of efficiency and meeting the dynamism requirement of autonomic clouds, our methodology consists of two steps: (i) off-line clustering and (ii) on-line prediction. Using datasets from real-world clouds, we demonstrate the superiority of our solution with respect to others and validate the accuracy of the on-line prediction. Moreover, to show the applicability of our approach, we devise a service scheduler that uses the notion of similarity among services and evaluate it in a cloud test-bed.

[1]  Yong Zhao,et al.  Cloud Computing and Grid Computing 360-Degree Compared , 2008, GCE 2008.

[2]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[3]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[4]  Xifeng Yan,et al.  Workload characterization and prediction in the cloud: A multiple time series approach , 2012, 2012 IEEE Network Operations and Management Symposium.

[5]  Huaimin Wang,et al.  Toward Fine-Grained, Unsupervised, Scalable Performance Diagnosis for Production Cloud Computing Systems , 2013, IEEE Transactions on Parallel and Distributed Systems.

[6]  Rocco De Nicola,et al.  SLAC: A Formal Service-Level-Agreement Language for Cloud Computing , 2014, 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing.

[7]  Alexandru Iosup,et al.  The Grid Workloads Archive , 2008, Future Gener. Comput. Syst..

[8]  Giancarlo Fortino,et al.  Managing Data and Processes in Cloud-Enabled Large-Scale Sensor Networks: State-of-the-Art and Future Research Directions , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[9]  Salim Hariri,et al.  Autonomic Computing: An Overview , 2004, UPP.

[10]  Jean-Marc Pierson,et al.  Multi-objective Scheduling for Heterogeneous Server Systems with Machine Placement , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[11]  Vasudeva Varma,et al.  Job Aware Scheduling Algorithm for MapReduce Framework , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[12]  R DasChita,et al.  Towards characterizing cloud backend workloads , 2010 .

[13]  H.E. Osman,et al.  Online incremental random forests , 2007, 2007 International Conference on Machine Vision.

[14]  Petr Jan Horn,et al.  Autonomic Computing: IBM's Perspective on the State of Information Technology , 2001 .

[15]  Juan Chen,et al.  Grid resource scheduling based on fuzzy similarity measures , 2008, 2008 IEEE Conference on Cybernetics and Intelligent Systems.

[16]  Archana Ganapathi,et al.  Analysis and Lessons from a Publicly Available Google Cluster Trace , 2010 .

[17]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[18]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[19]  Miin-Shen Yang,et al.  Fuzzy clustering algorithms for mixed feature variables , 2004, Fuzzy Sets Syst..

[20]  Rong Jin,et al.  Distance Metric Learning: A Comprehensive Survey , 2006 .

[21]  Horst Bischof,et al.  On-line Random Forests , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[22]  David B. Skillicorn,et al.  Streaming Random Forests , 2007, 11th International Database Engineering and Applications Symposium (IDEAS 2007).

[23]  Steve Horvath,et al.  Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma , 2005, Modern Pathology.

[24]  Lipika Dey,et al.  A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..

[25]  Michael C. Hout,et al.  Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.

[26]  Ran Xu,et al.  Random forests for metric learning with implicit pairwise position dependence , 2012, KDD.

[27]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[28]  Jie Zhou,et al.  HClustream: A Novel Approach for Clustering Evolving Heterogeneous Data Stream , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[29]  Albert Y. Zomaya,et al.  Profiling Applications for Virtual Machine Placement in Clouds , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[30]  Naveen Sharma,et al.  Towards autonomic workload provisioning for enterprise Grids and clouds , 2009, 2009 10th IEEE/ACM International Conference on Grid Computing.

[31]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[32]  Peter Kraft,et al.  High concentrations of long interspersed nuclear element sequence distinguish monoallelically expressed genes , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[34]  Chita R. Das,et al.  Towards characterizing cloud backend workloads: insights from Google compute clusters , 2010, PERV.

[35]  Hassab Elgawi Osman,et al.  Online random forests based on CorrFS and CorrBE , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[36]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[37]  S. Horvath,et al.  Unsupervised Learning With Random Forest Predictors , 2006 .

[38]  Arun Venkataramani,et al.  Black-box and Gray-box Strategies for Virtual Machine Migration , 2007, NSDI.

[39]  Yee Whye Teh,et al.  Mondrian Forests: Efficient Online Random Forests , 2014, NIPS.

[40]  M. Cugmas,et al.  On comparing partitions , 2015 .

[41]  Carlos Becker Westphall,et al.  Panoptes: A monitoring architecture and framework for supporting autonomic Clouds , 2014, 2014 IEEE Network Operations and Management Symposium (NOMS).

[42]  Minghua Jiang,et al.  A Flexible Grid Task Scheduling Algorithm Based on QoS Similarity , 2010, J. Convergence Inf. Technol..

[43]  Tao Wang,et al.  Workload-aware anomaly detection for Web applications , 2014, J. Syst. Softw..

[44]  Naveen Sharma,et al.  Design and evaluation of decentralized online clustering , 2012, TAAS.