Data Sets and Data Quality in Software Engineering: Eight Years On

OBJECTIVE - to assess the extent and types of techniques used to manage quality within software engineering data sets. We consider this a particularly interesting question in the context of initiatives to promote sharing and secondary analysis of data sets. METHOD - we perform a systematic review of available empirical software engineering studies. RESULTS - only 23 out of the many hundreds of studies assessed, explicitly considered data quality. CONCLUSIONS - first, the community needs to consider the quality and appropriateness of the data set being utilised; not all data sets are equal. Second, we need more research into means of identifying, and ideally repairing, noisy cases. Third, it should become routine to use sensitivity analysis to assess conclusion stability with respect to the assumptions that must be made concerning noise levels.

[1]  Jeffrey C. Carver,et al.  Knowledge-Sharing Issues in Experimental Software Engineering , 2004, Empirical Software Engineering.

[2]  Michael Gertz,et al.  Report on the Dagstuhl Seminar , 2004, SGMD.

[3]  Diane M. Strong,et al.  Data quality in context , 1997, CACM.

[4]  Barbara A. Kitchenham,et al.  A Further Empirical Investigation of the Relationship Between MRE and Project Size , 2003, Empirical Software Engineering.

[5]  Martin Shepperd,et al.  Filtering, Robust Filtering, Polishing: Techniques for Addressing Quality in Software Data , 2007, ESEM 2007.

[6]  Emilia Mendes,et al.  Replicating studies on cross- vs single-company effort models using the ISBSG Database , 2008, Empirical Software Engineering.

[7]  Norman E. Fenton,et al.  A Critique of Software Defect Prediction Models , 1999, IEEE Trans. Software Eng..

[8]  Taghi M. Khoshgoftaar,et al.  The pairwise attribute noise detection algorithm , 2007, Knowledge and Information Systems.

[9]  Taghi M. Khoshgoftaar,et al.  Knowledge discovery from imbalanced and noisy data , 2009, Data Knowl. Eng..

[10]  Ioannis Stamelos,et al.  A statistical framework for analyzing the duration of software projects , 2008, Empirical Software Engineering.

[11]  Philip M. Johnson Reengineering inspection , 1998, CACM.

[12]  Gernot Armin Liebchen,et al.  Data cleaning techniques for software engineering data sets , 2010 .

[13]  Raymund Sison,et al.  Personal software process (PSP) assistant , 2005, 12th Asia-Pacific Software Engineering Conference (APSEC'05).

[14]  Philip M. Johnson,et al.  A Critical Analysis of PSP Data Quality: Results from a Case Study , 1999, Empirical Software Engineering.

[15]  Stefan Biffl,et al.  Using a Reliability Growth Model to Control Software Inspection , 2002, Empirical Software Engineering.

[16]  Taghi M. Khoshgoftaar,et al.  A comprehensive empirical evaluation of missing value imputation in noisy software measurement data , 2008, J. Syst. Softw..

[17]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[18]  George Loizou,et al.  Quality of manual data collection in Java software: an empirical investigation , 2007, Empirical Software Engineering.

[19]  Martin Shepperd,et al.  Assessing the Quality and Cleaning of a Software Project Data Set: An Experience Report , 2006, EASE.

[20]  Stuart E. Madnick,et al.  Data quality requirements analysis and modeling , 2011, Proceedings of IEEE 9th International Conference on Data Engineering.

[21]  Taghi M. Khoshgoftaar,et al.  Rule-based noise detection for software measurement data , 2004, Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, 2004. IRI 2004..

[22]  Reidar Conradi,et al.  An empirical study of variations in COTS-based software development processes in the Norwegian IT industry , 2004, 10th International Symposium on Software Metrics, 2004. Proceedings..

[23]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[24]  Emilia Mendes,et al.  A Comparative Study of Cost Estimation Models for Web Hypermedia Applications , 2003, Empirical Software Engineering.

[25]  R. Gulezian,et al.  Software quality measurement and modeling, maturity, control and improvement , 1995, Proceedings of Software Engineering Standards Symposium.

[26]  Taghi M. Khoshgoftaar,et al.  Improving Software Quality Prediction by Noise Filtering Techniques , 2007, Journal of Computer Science and Technology.

[27]  Martin J. Shepperd,et al.  Software productivity analysis of a large data set and issues of confidentiality and data quality , 2005, 11th IEEE International Software Metrics Symposium (METRICS'05).

[28]  Doo-Hwan Bae,et al.  A pattern-based outlier detection method identifying abnormal attributes in software project data , 2010, Inf. Softw. Technol..

[29]  Thomas Redman,et al.  Data quality for the information age , 1996 .

[30]  Philip M. Johnson,et al.  Investigating data quality problems in the PSP , 1998, SIGSOFT '98/FSE-6.

[31]  Emilia Mendes,et al.  How Reliable Are Systematic Reviews in Empirical Software Engineering? , 2010, IEEE Transactions on Software Engineering.

[32]  Reidar Conradi,et al.  Quality, productivity and economic benefits of software reuse: a review of industrial studies , 2007, Empirical Software Engineering.

[33]  David J. Hand,et al.  How to lie with bad data , 2005 .

[34]  Philip M. Johnson,et al.  The Personal Software Process: A Cautionary Case Study , 1998, IEEE Softw..

[35]  N. Lavra,et al.  Experiments with noise detection algorithms inthe diagnosis of coronary artery diseaseD , 2022 .

[36]  Taghi M. Khoshgoftaar,et al.  Identifying noise in an attribute of interest , 2005, Fourth International Conference on Machine Learning and Applications (ICMLA'05).

[37]  Anders Wesslén,et al.  A Replicated Empirical Study of the Impact of the Methods in the PSP on Individual Engineers , 2000, Empirical Software Engineering.