Statistical Learning-Based Prediction of Execution Time of Data-Intensive Program Under Hadoop2.0

This paper is mainly to predict the running time of data-intensive MapReduce program under Hadoop2.0 environment. Although MapReduce programs are diverse, they can be divided into data-intensive and computationally intensive, depending on the time complexity and the nature of the program. The prediction of computationally intensive programs has always been difficult, and Hadoop has exhibited certain database attributes that are basically data-intensive. Moreover, the relationship between data-intensive programs and the amount of data is more closely related and shows certain statistical characteristics. So the method of statistical learning is applied to predict the execution time. This paper first generates training data and test data according to requirements, and then selects the appropriate features through the analysis of the logs. The prediction was first performed using the KCCA algorithm. However, the deficiencies were found. Then based on the characteristics of the kernel function, a prediction method based on deep learning was proposed, and the result was significant.

[1]  Bhavin J. Mathiya,et al.  Apache Hadoop Yarn Parameter configuration Challenges and Optimization , 2015, 2015 International Conference on Soft-Computing and Networks Security (ICSNS).

[2]  Yang Xiang,et al.  Hadoop Performance Modeling for Job Estimation and Resource Provisioning , 2016, IEEE Transactions on Parallel and Distributed Systems.

[3]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[4]  Lei Yu,et al.  A Hadoop MapReduce Performance Prediction Method , 2013, 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing.

[5]  Yang Liu,et al.  High-Responsive Scheduling with MapReduce Performance Prediction on Hadoop YARN , 2016, 2016 IEEE 22nd International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA).

[6]  Meng Wang,et al.  A Practical Performance Model for Hadoop MapReduce , 2012, 2012 IEEE International Conference on Cluster Computing Workshops.

[7]  Archana Ganapathi,et al.  Statistics-driven workload modeling for the Cloud , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[8]  Alessandro Maria Rizzi,et al.  Optimal Map Reduce Job Capacity Allocation in Cloud Systems , 2015, PERV.

[9]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[10]  Dong,et al.  Hadoop Performance Prediction Model Based on Random Forest , 2013 .

[11]  Roy H. Campbell,et al.  ARIA: automatic resource inference and allocation for mapreduce environments , 2011, ICAC '11.

[12]  Chao-Chun Yeh,et al.  Machine Learning-Based Configuration Parameter Tuning on Hadoop System , 2015, 2015 IEEE International Congress on Big Data.