Computing Resource Prediction for MapReduce Applications Using Decision Tree

The cloud computing paradigm offer users access to computing resource in a pay-as-you-go manner. However, to both cloud computing vendors and users, it is a challenge to predict how much resource is needed to run an application in a cloud at a required level of quality. This research focuses on developing a model to predict the computing resource consumption of MapReduce applications in the cloud computing environment. Based on the Classified and Regression Tree (CART), the proposed approach derives knowledge of the relationship among the application features, quality of service, and amount of computing resource, from a small training. The experiments show that the prediction accuracy is as high as 80%. This research can potentially benefit both the cloud vendors and users through improving resource management and reducing costs.

[1]  Richard Gibbons,et al.  A Historical Application Profiler for Use by Parallel Schedulers , 1997, JSSPP.

[2]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[3]  Radu Prodan,et al.  A Hybrid Intelligent Method for Performance Modeling and Prediction of Workflow Activities in Grids , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[4]  Warren Smith Prediction Services for Distributed Computing , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[5]  Archana Ganapathi,et al.  Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[6]  José A. B. Fortes,et al.  On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[7]  Thomas Fahringer,et al.  Using Templates to Predict Execution Time of Scientific Workflow Applications in the Grid , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[8]  Peter H. N. de With,et al.  Triple-C: Resource-usage prediction for semi-automatic parallelization of groups of dynamic image-processing tasks , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[9]  Archana Ganapathi,et al.  Statistics-driven workload modeling for the Cloud , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[10]  Warren Smith,et al.  Predicting Application Run Times Using Historical Information , 1998, JSSPP.

[11]  Ivan Rodero,et al.  The Grid Backfilling: a Multi-Site Scheduling Architecture with Data Mining Prediction Techniques , 2008 .

[12]  Dror G. Feitelson,et al.  Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling , 2001, IEEE Trans. Parallel Distributed Syst..