Case consistency: a necessary data quality property for software engineering data sets

Data quality is an essential aspect in any empirical study, because the validity of models and/or analysis results derived from an empirical data is inherently influenced by its quality. In this empirical study, we focus on data consistency as a critical factor influencing the accuracy of prediction models in software engineering. We propose a software metric called Cases Inconsistency Level (CIL) for analyzing conflicts within software engineering data sets by leveraging probability statistics on project cases and counting the number of conflicting pairs. The result demonstrated that CIL is able to be used as a metric to identify either consistent data sets or inconsistent data sets, which are valuable for building robust prediction models. In addition to measuring the level of consistency, CIL is proved to be applicable to predict whether or not an effort model built from data set can achieve higher accuracy, an important indicator for empirical experiments in software engineering.

[1]  Ching-Hsue Cheng,et al.  Software Diagnosis Using Fuzzified Attribute Base on Modified MEPA , 2006, IEA/AIE.

[2]  Taghi M. Khoshgoftaar,et al.  Noise Correction using Bayesian Multiple Imputation , 2006, 2006 IEEE International Conference on Information Reuse & Integration.

[3]  Bruce Christianson,et al.  The misuse of the NASA metrics data program data sets for automated software defect prediction , 2011, EASE.

[4]  Martin J. Shepperd,et al.  Software productivity analysis of a large data set and issues of confidentiality and data quality , 2005, 11th IEEE International Software Metrics Symposium (METRICS'05).

[5]  Stephen G. MacDonell,et al.  What accuracy statistics really measure , 2001, IEE Proc. Softw..

[6]  John Francis Kros,et al.  Data mining and the impact of missing data , 2003, Ind. Manag. Data Syst..

[7]  Carolyn Mair,et al.  The consistency of empirical comparisons of regression and analogy-based software project cost prediction , 2005, 2005 International Symposium on Empirical Software Engineering, 2005..

[8]  Tore Dybå,et al.  Empirical studies of agile software development: A systematic review , 2008, Inf. Softw. Technol..

[9]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[10]  Barbara A. Kitchenham,et al.  A Simulation Study of the Model Evaluation Criterion MMRE , 2003, IEEE Trans. Software Eng..

[11]  Akito Monden Some Open Problems in Software Project Data Analysis , 2007 .

[12]  Magne Jørgensen,et al.  A review of studies on expert estimation of software development effort , 2004, J. Syst. Softw..

[13]  Aristides Dasso,et al.  Software Quality Metrics Aggregation , 2012 .

[14]  Francesco Ricci,et al.  Probability Based Metrics for Nearest Neighbor Classification and Case-Based Reasoning , 1999, ICCBR.

[15]  R BasiliVictor,et al.  A Methodology for Collecting Valid Software Engineering Data , 1984 .

[16]  Barry W. Boehm,et al.  Productivity trends in incremental and iterative software development , 2009, 2009 3rd International Symposium on Empirical Software Engineering and Measurement.

[17]  Geeta Sikka,et al.  Recent methods for software effort estimation by analogy , 2011, SOEN.

[18]  Barbara Kitchenham,et al.  Procedures for Performing Systematic Reviews , 2004 .

[19]  Rongxin Wu,et al.  Dealing with noise in defect prediction , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[20]  Christian Bird,et al.  Diversity in software engineering research , 2013, ESEC/FSE 2013.

[21]  Taghi M. Khoshgoftaar,et al.  Software quality estimation with limited fault data: a semi-supervised learning perspective , 2007, Software Quality Journal.

[22]  Stephen G. MacDonell,et al.  A Taxonomy of Data Quality Challenges in Empirical Software Engineering , 2013, 2013 22nd Australian Software Engineering Conference.

[23]  Victor R. Basili,et al.  A Methodology for Collecting Valid Software Engineering Data , 1984, IEEE Transactions on Software Engineering.

[24]  Nicolas Anquetil,et al.  Software quality metrics aggregation in industry , 2013, J. Softw. Evol. Process..

[25]  Philip B. Crosby,et al.  Quality Is Free: The Art of Making Quality Certain , 1979 .

[26]  Martin Shepperd,et al.  Assessing the Quality and Cleaning of a Software Project Data Set: An Experience Report , 2006, EASE.

[27]  Taghi M. Khoshgoftaar,et al.  Imputation techniques for multivariate missingness in software measurement data , 2008, Software Quality Journal.

[28]  Mohammad Azzeh,et al.  Learning best K analogies from data distribution for case-based software effort estimation , 2012, ICSEA 2012.

[29]  Emilia Mendes,et al.  A Comparative Study of Cost Estimation Models for Web Hypermedia Applications , 2003, Empirical Software Engineering.

[30]  R. Lewis An Introduction to Classification and Regression Tree (CART) Analysis , 2000 .

[31]  Naohiro Ishii,et al.  Combining Multiple k-Nearest Neighbor Classifiers Using Different Distance Functions , 2004, IDEAL.

[32]  Martin Shepperd,et al.  Data Sets and Data Quality in Software Engineering: Eight Years On , 2016, PROMISE.

[33]  Akito Monden,et al.  Software development productivity of Japanese enterprise applications , 2009, Inf. Technol. Manag..

[34]  Akito Monden,et al.  Improving Analogy-Based Software Cost Estimation through Probabilistic-Based Similarity Measures , 2013, 2013 20th Asia-Pacific Software Engineering Conference (APSEC).

[35]  Khaled El Emam,et al.  Software Cost Estimation with Incomplete Data , 2001, IEEE Trans. Software Eng..

[36]  Slobodan Vucetic,et al.  MS-kNN: protein function prediction by integrating multiple data sources , 2013, BMC Bioinformatics.

[37]  Taghi M. Khoshgoftaar,et al.  A comprehensive empirical evaluation of missing value imputation in noisy software measurement data , 2008, J. Syst. Softw..

[38]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[39]  Tim Menzies,et al.  On the Value of Ensemble Effort Estimation , 2012, IEEE Transactions on Software Engineering.

[40]  D. Altman,et al.  Measuring inconsistency in meta-analyses , 2003, BMJ : British Medical Journal.

[41]  Mohammad Azzeh A replicated assessment and comparison of adaptation techniques for analogy-based effort estimation , 2011, Empirical Software Engineering.

[42]  Tim Menzies,et al.  Finding conclusion stability for selecting the best effort predictor in software effort estimation , 2012, Automated Software Engineering.