Active learning and effort estimation: Finding the essential content of software effort estimation data

Background: Do we always need complex methods for software effort estimation (SEE)? Aim: To characterize the essential content of SEE data, i.e., the least number of features and instances required to capture the information within SEE data. If the essential content is very small, then 1) the contained information must be very brief and 2) the value added of complex learning schemes must be minimal. Method: Our QUICK method computes the euclidean distance between rows (instances) and columns (features) of SEE data, then prunes synonyms (similar features) and outliers (distant instances), then assesses the reduced data by comparing predictions from 1) a simple learner using the reduced data and 2) a state-of-the-art learner (CART) using all data. Performance is measured using hold-out experiments and expressed in terms of mean and median MRE, MAR, PRED(25), MBRE, MIBRE, or MMER. Results: For 18 datasets, QUICK pruned 69 to 96 percent of the training data (median = 89 percent). K = 1 nearest neighbor predictions (in the reduced data) performed as well as CART's predictions (using all data). Conclusion: The essential content of some SEE datasets is very small. Complex estimation methods may be overelaborate for such datasets and can be simplified. We offer QUICK as an example of such a simpler SEE method.

[1]  Ayse Basar Bener,et al.  Exploiting the Essential Assumptions of Analogy-Based Effort Estimation , 2012, IEEE Transactions on Software Engineering.

[2]  John Langford,et al.  Agnostic active learning , 2006, J. Comput. Syst. Sci..

[3]  James M. Rehg,et al.  Active learning for automatic classification of software behavior , 2004, ISSTA '04.

[4]  Karen T. Lum,et al.  Selecting Best Practices for Effort Estimation , 2006, IEEE Transactions on Software Engineering.

[5]  Magne Jørgensen,et al.  A Systematic Review of Software Development Cost Estimation Studies , 2007 .

[6]  Chin-Liang Chang,et al.  Finding Prototypes For Nearest Neighbor Classifiers , 1974, IEEE Transactions on Computers.

[7]  Franco Turini,et al.  Time-Annotated Sequences for Medical Data Mining , 2007 .

[8]  Barbara A. Kitchenham,et al.  A Simulation Study of the Model Evaluation Criterion MMRE , 2003, IEEE Trans. Software Eng..

[9]  Ayse Basar Bener,et al.  A comparative study for estimating software development effort intervals , 2011, Software Quality Journal.

[10]  Tim Menzies,et al.  On the Value of Ensemble Effort Estimation , 2012, IEEE Transactions on Software Engineering.

[11]  Chris F. Kemerer,et al.  An empirical validation of software cost estimation models , 1987, CACM.

[12]  Christian Bird,et al.  The inductive software engineering manifesto: principles for industrial data mining , 2011, MALETS '11.

[13]  Sanjoy Dasgupta,et al.  Analysis of a greedy active learning strategy , 2004, NIPS.

[14]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[15]  Ellis Horowitz,et al.  Software Cost Estimation with COCOMO II , 2000 .

[16]  Jack P. C. Kleijnen Sensitivity analysis and related analysis : A survey of statistical techniques , 1995 .

[17]  Emilia Mendes Cost Estimation of Web Applications through Knowledge Elicitation , 2011, ICEIS.

[18]  Emilia Mendes,et al.  Using Chronological Splitting to Compare Cross- and Single-company Effort Models: Further Investigation , 2009, ACSC.

[19]  Emilia Mendes,et al.  Applying moving windows to software effort estimation , 2009, 2009 3rd International Symposium on Empirical Software Engineering and Measurement.

[20]  Barry W. Boehm,et al.  Software Engineering Economics , 1993, IEEE Transactions on Software Engineering.

[21]  Stephen G. MacDonell,et al.  Evaluating prediction systems in software project estimation , 2012, Inf. Softw. Technol..

[22]  Emilia Mendes,et al.  A Comparative Study of Cost Estimation Models for Web Hypermedia Applications , 2003, Empirical Software Engineering.

[23]  David Notkin,et al.  Mutually Enhancing Test Generation and Specification Inference , 2003, FATES.

[24]  Magne Jørgensen,et al.  A review of studies on expert estimation of software development effort , 2004, J. Syst. Softw..

[25]  Martin J. Shepperd,et al.  Comparing Software Prediction Techniques Using Simulation , 2001, IEEE Trans. Software Eng..

[26]  Tim Menzies,et al.  2CEE, A TWENTY FIRST CENTURY EFFORT ESTIMATION METHODOLOGY , 2008 .

[27]  Ekrem Kocaguneli,et al.  Xiruxe: An intelligent fault tracking tool , 2009 .

[28]  Tim Menzies,et al.  Finding conclusion stability for selecting the best effort predictor in software effort estimation , 2012, Automated Software Engineering.

[29]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[30]  Emilia Mendes,et al.  How effective is Tabu search to configure support vector regression for effort estimation? , 2010, PROMISE '10.

[31]  Lefteris Angelis,et al.  Ranking and Clustering Software Cost Estimation Models through a Multiple Comparisons Algorithm , 2013, IEEE Transactions on Software Engineering.

[32]  Karen T. Lum,et al.  Stable rankings for different effort models , 2010, Automated Software Engineering.

[33]  Magne Jørgensen,et al.  A Systematic Review of Software Development Cost Estimation Studies , 2007, IEEE Transactions on Software Engineering.

[34]  Martin Shepperd,et al.  On configuring a case-based reasoning software project prediction system , 2000 .

[35]  Tao Xie,et al.  Software intelligence: the future of mining software engineering data , 2010, FoSER '10.

[36]  John E. Gaffney,et al.  Software Function, Source Lines of Code, and Development Effort Prediction: A Software Science Validation , 1983, IEEE Transactions on Software Engineering.

[37]  Bora Caglayan,et al.  Experiences on Developer Participation and Effort Estimation , 2011, 2011 37th EUROMICRO Conference on Software Engineering and Advanced Applications.

[38]  Katrina D. Maxwell,et al.  Applied Statistics for Software Managers , 2002 .

[39]  Y. Miyazaki,et al.  Robust regression for developing software estimation models , 1994, J. Syst. Softw..

[40]  Carla E. Brodley,et al.  Active learning for biomedical citation screening , 2010, KDD.

[41]  Jack P. C. Kleijnen,et al.  Sensitivity analysis and related analyses: A review of some statistical techniques , 1997 .

[42]  Magne Jørgensen,et al.  The Impact of Lessons-Learned Sessions on Effort Estimation and Uncertainty Assessments , 2009, IEEE Transactions on Software Engineering.

[43]  Thong Ngee Goh,et al.  A study of project selection and feature weighting for analogy based software cost estimation , 2009, J. Syst. Softw..

[44]  Jacky W. Keung Theoretical Maximum Prediction Accuracy for Analogy-Based Software Cost Estimation , 2008, 2008 15th Asia-Pacific Software Engineering Conference.

[45]  Ramesh Nallapati,et al.  A Comparative Study of Methods for Transductive Transfer Learning , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[46]  Guangchun Luo,et al.  Transfer learning for cross-company software defect prediction , 2012, Inf. Softw. Technol..

[47]  Martin J. Shepperd,et al.  Estimating Software Project Effort Using Analogies , 1997, IEEE Trans. Software Eng..

[48]  D. Ross Jeffery,et al.  An Empirical Study of Analogy-based Software Effort Estimation , 1999, Empirical Software Engineering.

[49]  Kari Känsälä,et al.  Inter-item correlations among function points , 1993, [1993] Proceedings First International Software Metrics Symposium.

[50]  Lionel C. Briand,et al.  An assessment and comparison of common software cost estimation modeling techniques , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[51]  MaYing,et al.  Transfer learning for cross-company software defect prediction , 2012 .

[52]  Bart Baesens,et al.  Data Mining Techniques for Software Effort Estimation: A Comparative Study , 2012, IEEE Transactions on Software Engineering.