Investigating the use of random forest in software effort estimation

Abstract Over the last two decades, there has been an important increase in studies dealing with the software development effort estimation (SDEE) using machine learning (ML) techniques that aimed to improve the accuracy of the estimates and to understand the process used to generate these estimates. Among these ML techniques, decision tree-based models have received a considerable scholarly attention thanks to their generalization ability and understandability. However, very few studies have investigated the use of random forest (RF) in software effort estimation. In this paper, a RF model is designed and optimized empirically by varying the values of its key parameters. The performance of the RF is compared with that of classical regression tree (RT). The evaluation was performed through the 30% hold-out validation method using three datasets: ISBSG R8, Tukutuku and COCOMO. To identify the most accurate techniques, we used three widely known accuracy measures: Pred(0.25), MMRE and MdMRE. The results show that the optimized random forest outperforms the regression trees model on all evaluation criteria.

[1]  Alain Abran,et al.  Software Development Effort estimation using Classical and fuzzy Analogy: a Cross-Validation Comparative Study , 2014, Int. J. Comput. Intell. Appl..

[2]  Magne Jørgensen,et al.  A review of studies on expert estimation of software development effort , 2004, J. Syst. Softw..

[3]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[4]  Adam A. Porter,et al.  Learning from Examples: Generation and Evaluation of Decision Trees for Software Resource Analysis , 1988, IEEE Trans. Software Eng..

[5]  Barbara A. Kitchenham,et al.  A Simulation Study of the Model Evaluation Criterion MMRE , 2003, IEEE Trans. Software Eng..

[6]  Alain Abran,et al.  Evaluating Pred(p) and standardized accuracy criteria in software development effort estimation , 2018, J. Softw. Evol. Process..

[7]  Stephen G. MacDonell,et al.  Combining techniques to optimize effort predictions in software project management , 2003, J. Syst. Softw..

[8]  Alain Abran,et al.  Systematic literature review of ensemble effort estimation , 2016, J. Syst. Softw..

[9]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[10]  Adam A. Porter,et al.  Evaluating techniques for generating metric-based classification trees , 1990, J. Syst. Softw..

[11]  Dirk Van den Poel,et al.  Predicting customer retention and profitability by using random forests and regression forests techniques , 2005, Expert Syst. Appl..

[12]  Magne Jørgensen,et al.  A Systematic Review of Software Development Cost Estimation Studies , 2007 .

[13]  Barry W. Boehm,et al.  Software Engineering Economics , 1993, IEEE Transactions on Software Engineering.

[14]  Magne Jørgensen,et al.  The effects of request formats on judgment-based effort estimation , 2010, J. Syst. Softw..

[15]  Barry Boehm,et al.  Software Cost Estimation with Cocomo II with Cdrom , 2000 .

[16]  Magne Jørgensen,et al.  Practical Guidelines for Expert-Judgment-Based Software Effort Estimation , 2005, IEEE Softw..

[17]  Praynlin Edinson,et al.  Performance analysis of FCM based ANFIS and ELMAN neural network in software effort estimation , 2018, Int. Arab J. Inf. Technol..

[18]  Douglas Fisher,et al.  Machine Learning Approaches to Estimating Software Development Effort , 1995, IEEE Trans. Software Eng..

[19]  Yong Hu,et al.  Systematic literature review of machine learning based software development effort estimation models , 2012, Inf. Softw. Technol..

[20]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[21]  Witold Pedrycz,et al.  Genetically optimized fuzzy decision trees , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[22]  Manoj Kumar Tiwari,et al.  Soft decision trees: A genetically optimized cluster oriented approach , 2009, Expert Syst. Appl..

[23]  Chris F. Kemerer,et al.  An empirical validation of software cost estimation models , 1987, CACM.

[24]  Ali Idri,et al.  Applying Fuzzy ID3 Decision Tree for Software Effort Estimation , 2011, ArXiv.

[25]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[26]  Alex Alves Freitas,et al.  Inducing decision trees with an ant colony optimization algorithm , 2012, Appl. Soft Comput..

[27]  Mahmoud O. Elish Improved estimation of software project effort using multiple additive regression trees , 2009, Expert Syst. Appl..

[28]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..