Modelling and Prediction of Resource Utilization of Hadoop Clusters: A Machine Learning Approach

Hadoop is a distributed computing framework that has a large number of configurable parameters. These parameters have impact on system resources and execution time. Optimizing the performance of a Hadoop cluster by tuning such a large number of parameters is a tedious task. Most current big data modeling approaches does not include complex interaction between configuration parameters and the cluster environment changes such as different datasets or query. This makes it difficult to predict the performance or resource utilization of a cluster when we use real-world datasets because of their size and content. This paper presents the modeling of resource utilization of Hadoop cluster on the basis of Hadoop configuration parameters and dataset structure. Our approach builds a machine learning based-model using Hive-based Hadoop query and then predict the outcome for a particular parameter setting and query type. We used decision trees to build models for each of our performance metric measures. Decision rules were extracted from these tree-based models and evaluated for their ability to generalize to unseen data. Our experiments predicted that the percentage of columns selected, mappers and replica has a statistically significant impact over the utilization of different resources in Hadoop cluster.

[1]  Tag Gon Kim,et al.  Cooperation between data modeling and simulation modeling for performance analysis of Hadoop , 2017, 2017 International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS).

[2]  Dimiter R. Avresky,et al.  Machine learning-based management of cloud applications in hybrid clouds: A Hadoop case study , 2017, 2017 IEEE 16th International Symposium on Network Computing and Applications (NCA).

[3]  Kai Sasaki,et al.  Professional Hadoop®: Anthony/Professional Hadoop , 2016 .

[4]  Ben He,et al.  A Novel Method for Tuning Configuration Parameters of Spark Based on Machine Learning , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[5]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[6]  Giovanni Seni,et al.  Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions , 2010, Ensemble Methods in Data Mining.

[7]  Xiang Chen,et al.  Optimizing Performance of Hadoop with Parameter Tuning , 2017 .

[8]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[9]  Marta Mrak,et al.  Decision Trees for Complexity Reduction in Video Compression , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[10]  Ilya Safro,et al.  Machine Learning in Transportation Data Analytics , 2017 .

[11]  David Carrera,et al.  ALOJA: A Framework for Benchmarking and Predictive Analytics in Hadoop Deployments , 2017, IEEE Transactions on Emerging Topics in Computing.

[12]  Hwaiyu Geng,et al.  Internet of Things and Data Analytics Handbook , 2017 .

[13]  Christer Åhlund,et al.  Machine Learning in Pervasive Computing , 2013 .

[14]  Shan Suthaharan,et al.  Decision Tree Learning , 2016 .

[15]  Dick H. J. Epema,et al.  Towards Machine Learning-Based Auto-tuning of MapReduce , 2013, 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems.

[16]  Divya Upadhyay,et al.  Concurrency control techniques in HDFS , 2014, 2014 5th International Conference - Confluence The Next Generation Information Technology Summit (Confluence).

[17]  Yang Xiang,et al.  Hadoop Performance Modeling for Job Estimation and Resource Provisioning , 2016, IEEE Transactions on Parallel and Distributed Systems.

[18]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[19]  Ruoming Jin,et al.  Data discretization unification , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[20]  Motahar Reza,et al.  Study and analysis of hadoop cluster optimization based on configuration properties , 2017, 2017 Innovations in Power and Advanced Computing Technologies (i-PACT).

[21]  Abdelaziz Marzak,et al.  Decision Trees Based Software Development Effort Estimation: A Systematic Mapping Study , 2019, 2019 International Conference of Computer Science and Renewable Energies (ICCSRE).

[22]  Yun Li,et al.  Machine Learning with Sensitivity Analysis to Determine Key Factors Contributing to Energy Consumption in Cloud Data Centers , 2016, 2016 International Conference on Cloud Computing Research and Innovations (ICCCRI).

[23]  Avesta Sasan,et al.  Hardware Accelerated Mappers for Hadoop MapReduce Streaming , 2018, IEEE Transactions on Multi-Scale Computing Systems.

[24]  Ian Welch,et al.  An Investigation of Hadoop Parameters in SDN-enabled Clusters , 2018, 2018 12th International Conference on Mathematics, Actuarial Science, Computer Science and Statistics (MACS).

[25]  Fang Liu,et al.  Prediction of total execution time for MapReduce applications , 2016, 2016 Sixth International Conference on Information Science and Technology (ICIST).

[26]  Yushui Geng,et al.  Research of entity recognition method based on learning under Hadoop , 2017, 2017 First International Conference on Electronics Instrumentation & Information Systems (EIIS).

[27]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[28]  Eugenio Gianniti,et al.  A Combined Analytical Modeling Machine Learning Approach for Performance Prediction of MapReduce Jobs in Cloud Environment , 2016, 2016 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC).

[29]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[30]  Sanjay Agrawal,et al.  An experimental approach towards big data for analyzing memory utilization on a hadoop cluster using HDFS and MapReduce , 2014, 2014 First International Conference on Networks & Soft Computing (ICNSC2014).

[31]  Konstantin V. Shvachko,et al.  HDFS Scalability: The Limits to Growth , 2010, login Usenix Mag..

[32]  Suresh Subramaniam,et al.  LASER: A Deep Learning Approach for Speculative Execution and Replication of Deadline-Critical Jobs in Cloud , 2017, 2017 26th International Conference on Computer Communication and Networks (ICCCN).

[33]  Yuqing Zhu,et al.  BestConfig: tapping the performance potential of systems via automatic configuration tuning , 2017, SoCC.

[34]  Wanjiun Liao,et al.  Learning-Based Memory Allocation Optimization for Delay-Sensitive Big Data Processing , 2018, IEEE Transactions on Parallel and Distributed Systems.