Data quality: cinderella at the software metrics ball?

In this keynote I explore what exactly do we mean by data quality, techniques to assess data quality and the very significant challenges that poor data quality can pose. I believe we neglect data quality at our peril since - whether we like it or not - our research results are founded upon data and our assumptions that data quality issues do not confound our results. A systematic review of the literature suggests that it is a minority practice to even explicitly discuss data quality. I therefore suggest that this topic should become a higher priority amongst empirical software engineering researchers.

[1]  Ekrem Kocaguneli,et al.  A Ranking Stability Indicator for Selecting the Best Effort Estimator in Software Cost Estimation , 2011 .

[2]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[3]  Bhekisipho Twala,et al.  Filtering, Robust Filtering, Polishing: Techniques for Addressing Quality in Software Data , 2007, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007).

[4]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[5]  Martin Shepperd,et al.  Data Sets and Data Quality in Software Engineering: Eight Years On , 2016, PROMISE.

[6]  Philip M. Johnson,et al.  A Critical Analysis of PSP Data Quality: Results from a Case Study , 1999, Empirical Software Engineering.

[7]  David J. Hand,et al.  How to lie with bad data , 2005 .

[8]  Ingunn Myrtveit,et al.  Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods , 2001, IEEE Trans. Software Eng..

[9]  Diane M. Strong,et al.  Data quality in context , 1997, CACM.

[10]  Ernesto Damiani,et al.  Discovering the software process by means of stochastic workflow analysis , 2006, J. Syst. Archit..

[11]  Joseph A. C. Delaney Sensitivity analysis , 2018, The African Continental Free Trade Area: Economic and Distributional Effects.

[12]  Taghi M. Khoshgoftaar,et al.  The pairwise attribute noise detection algorithm , 2007, Knowledge and Information Systems.

[13]  Khaled El Emam,et al.  Software Cost Estimation with Incomplete Data , 2001, IEEE Trans. Software Eng..

[14]  Magne Jørgensen,et al.  An analysis of data sets used to train and validate cost prediction systems , 2005, PROMISE '05.

[15]  Doo-Hwan Bae,et al.  A pattern-based outlier detection method identifying abnormal attributes in software project data , 2010, Inf. Softw. Technol..

[16]  J. Moses,et al.  Bayesian probability distributions for assessing measurement of subjective software attributes , 2000, Inf. Softw. Technol..

[17]  Pearl Brereton,et al.  Systematic literature reviews in software engineering - A systematic literature review , 2009, Inf. Softw. Technol..

[18]  Gernot Armin Liebchen,et al.  Data cleaning techniques for software engineering data sets , 2010 .

[19]  Choh-Man Teng,et al.  A Comparison of Noise Handling Techniques , 2001, FLAIRS.

[20]  Abraham Bernstein,et al.  Software process data quality and characteristics: a historical view on open and closed source projects , 2009, IWPSE-Evol '09.