Assessing the Quality and Cleaning of a Software Project Data Set: An Experience Report

OBJECTIVE - The aim is to report upon an assessment of the impact noise has on the predictive accuracy by comparing noise handling techniques. METHOD - We describe the process of cleaning a large software management dataset comprising initially of more than 10,000 projects. The data quality is mainly assessed through feedback from the data provider and manual inspection of the data. Three methods of noise correction (polishing, noise elimination and robust algorithms) are compared with each other assessing their accuracy. The noise detection was undertaken by using a regression tree model. RESULTS - Three noise correction methods are compared and different results in their accuracy where noted. CONCLUSIONS - The results demonstrated that polishing improves classification accuracy compared to noise elimination and robust algorithms approaches.

[1]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[2]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[3]  D. Wolpert On Overfitting Avoidance as Bias , 1993 .

[4]  Saso Dzeroski,et al.  Noise Elimination in Inductive Concept Learning: A Case Study in Medical Diagnosois , 1996, ALT.

[5]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[6]  Hongxing He,et al.  Outlier Detection Using Replicator Neural Networks , 2002, DaWaK.

[7]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[8]  Choh-Man Teng,et al.  Correcting Noisy Data , 1999, ICML.

[9]  George H. John Robust Decision Trees: Removing Outliers from Databases , 1995, KDD.

[10]  Robert Chambers,et al.  Robust automatic methods for outlier and error detection , 2004 .

[11]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[12]  Nada Lavrac,et al.  Experiments with Noise Filtering in a Medical Domain , 1999, ICML.

[13]  Michel Manago,et al.  Noise and Knowledge Acquisition , 1987, IJCAI.

[14]  Choh-Man Teng Evaluating Noise Correction , 2000, PRICAI.

[15]  Richard S. Forsyth,et al.  Overfitting revisited: an information-theoretic approach to simplifying discrimination trees , 1994, J. Exp. Theor. Artif. Intell..

[16]  Choh-Man Teng,et al.  Combining Noise Correction with Feature Selection , 2003, DaWaK.

[17]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[18]  J. Ross Quinlan,et al.  Simplifying decision trees , 1987, Int. J. Hum. Comput. Stud..