Clustering Dycom: An Online Cross-Company Software Effort Estimation Study

Background: Software Effort Estimation (SEE) can be formulated as an online learning problem, where new projects are completed over time and may become available for training. In this scenario, a Cross-Company (CC) SEE approach called Dycom can drastically reduce the number of Within-Company (WC) projects needed for training, saving the high cost of collecting such training projects. However, Dycom relies on splitting CC projects into different subsets in order to create its CC models. Such splitting can have a significant impact on Dycom's predictive performance. Aims: This paper investigates whether clustering methods can be used to help finding good CC splits for Dycom. Method: Dycom is extended to use clustering methods for creating the CC subsets. Three different clustering methods are investigated, namely Hierarchical Clustering, K-Means, and Expectation-Maximisation. Clustering Dycom is compared against the original Dycom with CC subsets of different sizes, based on four SEE databases. A baseline WC model is also included in the analysis. Results: Clustering Dycom with K-Means can potentially help to split the CC projects, managing to achieve similar or better predictive performance than Dycom. However, K-Means still requires the number of CC subsets to be pre-defined, and a poor choice can negatively affect predictive performance. EM enables Dycom to automatically set the number of CC subsets while still maintaining or improving predictive performance with respect to the baseline WC model. Clustering Dycom with Hierarchical Clustering did not offer significant advantage in terms of predictive performance. Conclusion: Clustering methods can be an effective way to automatically generate Dycom's CC subsets.

[1]  Lionel C. Briand,et al.  A replicated assessment and comparison of common software cost modeling techniques , 2000, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[2]  Shari Lawrence Pfleeger,et al.  An empirical study of maintenance and development estimation accuracy , 2002, J. Syst. Softw..

[3]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[4]  Xin Yao,et al.  Which models of the past are relevant to the present? A software effort estimation approach to exploiting useful past models , 2016, Automated Software Engineering.

[5]  A. Vargha,et al.  A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong , 2000 .

[6]  Leandro L. Minku On the Terms Within- and Cross-Company in Software Effort Estimation , 2016, PROMISE.

[7]  Forrest Shull,et al.  Local versus Global Lessons for Defect Prediction and Effort Estimation , 2013, IEEE Transactions on Software Engineering.

[8]  Lionel C. Briand,et al.  A replicated Assessment of Common Software Cost Estimation Techniques , 2000, ICSE 2000.

[9]  Tim Menzies,et al.  When to use data from other projects for effort estimation , 2010, ASE.

[10]  Bart Baesens,et al.  Data Mining Techniques for Software Effort Estimation: A Comparative Study , 2012, IEEE Transactions on Software Engineering.

[11]  Sousuke Amasaki,et al.  Performance Evaluation of Windowing Approach on Effort Estimation by Analogy , 2011, 2011 Joint Conference of the 21st International Workshop on Software Measurement and the 6th International Conference on Software Process and Product Measurement.

[12]  Xin Yao,et al.  journal homepage: www.elsevier.com/locate/infsof Ensembles and locality: Insight on improving software effort estimation , 2022 .

[13]  Stephen G. MacDonell,et al.  Comparing Local and Global Software Effort Estimation Models -- Reflections on a Systematic Review , 2007, ESEM 2007.

[14]  Emilia Mendes,et al.  How to Make Best Use of Cross-Company Data for Web Effort Estimation? , 2015, 2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[15]  Ioannis Stamelos,et al.  Software productivity and effort prediction with ordinal regression , 2005, Inf. Softw. Technol..

[16]  Stephen G. MacDonell,et al.  Comparing Local and Global Software Effort Estimation Models -- Reflections on a Systematic Review , 2007, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007).

[17]  Gregory Ditzler,et al.  Learning in Nonstationary Environments: A Survey , 2015, IEEE Computational Intelligence Magazine.

[18]  Martin J. Shepperd,et al.  Using Genetic Programming to Improve Software Effort Estimation Based on General Data Sets , 2003, GECCO.

[19]  Barry W. Boehm,et al.  Software Engineering Economics , 1993, IEEE Transactions on Software Engineering.

[20]  Stephen G. MacDonell,et al.  Evaluating prediction systems in software project estimation , 2012, Inf. Softw. Technol..

[21]  Burak Turhan,et al.  A Comparison of Cross-Versus Single-Company Effort Prediction Models for Web Projects , 2014, 2014 40th EUROMICRO Conference on Software Engineering and Advanced Applications.

[22]  Leandro L. Minku An Investigation of Dycom ’ s Sensitivity to Different Cross-Company Splits , 2017 .

[23]  Martin J. Shepperd,et al.  Estimating Software Project Effort Using Analogies , 1997, IEEE Trans. Software Eng..

[24]  Bojan Cukic,et al.  Building a second opinion: learning cross-company data , 2013, PROMISE.

[25]  Barbara Kitchenham,et al.  A comparison of cross-company and within-company effort estimation models for Web applications , 2004, ICSE 2004.

[26]  Guilherme Horta Travassos,et al.  Cross versus Within-Company Cost Estimation Studies: A Systematic Review , 2007, IEEE Transactions on Software Engineering.

[27]  Miguel-Ángel Sicilia,et al.  Software Project Effort Estimation Based on Multiple Parametric Models Generated Through Data Clustering , 2007, Journal of Computer Science and Technology.

[28]  Tim Menzies,et al.  "Better Data" is Better than "Better Data Miners" (Benefits of Tuning SMOTE for Defect Prediction) , 2017, ICSE.

[29]  Xin Yao,et al.  Can cross-company data improve performance in software effort estimation? , 2012, PROMISE '12.

[30]  Xin Yao,et al.  How to make best use of cross-company data in software effort estimation? , 2014, ICSE.

[31]  Tim Menzies,et al.  Special issue on repeatable results in software engineering prediction , 2012, Empirical Software Engineering.

[32]  Ayse Basar Bener,et al.  Exploiting the Essential Assumptions of Analogy-Based Effort Estimation , 2012, IEEE Transactions on Software Engineering.

[33]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[34]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[35]  D. Ross Jeffery,et al.  A comparative study of two software development cost modeling techniques using multi-organizational and company-specific data , 2000, Inf. Softw. Technol..

[36]  Isabella Wieczorek,et al.  How valuable is company-specific data compared to multi-company data for software cost estimation? , 2002, Proceedings Eighth IEEE Symposium on Software Metrics.

[37]  Yu-Jen Liu,et al.  A comparative evaluation on the accuracies of software effort estimates from clustered data , 2008, Inf. Softw. Technol..

[38]  Magne Jørgensen,et al.  A Systematic Review of Software Development Cost Estimation Studies , 2007, IEEE Transactions on Software Engineering.

[39]  Tim Menzies,et al.  Tuning for Software Analytics: is it Really Necessary? , 2016, Inf. Softw. Technol..

[40]  Tim Menzies,et al.  Transfer learning in effort estimation , 2015, Empirical Software Engineering.

[41]  Lionel C. Briand,et al.  A practical guide for using statistical tests to assess randomized algorithms in software engineering , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[42]  Ayse Bener,et al.  Evaluation of Feature Extraction Methods on Software Cost Estimation , 2007, ESEM 2007.

[43]  Xin Yao,et al.  The impact of parameter tuning on software effort estimation using learning machines , 2013, PROMISE.