Identifying learners robust to low quality data

Real world datasets commonly contain noise that is distributed in both the independent and dependent variables. Noise, which typically consists of erroneous variable values, has been shown to significantly affect the classification performance of learners. In this study, we identify learners with robust performance in the presence of low quality (noisy) measurement data. Noise was injected into five class imbalanced software engineering measurement datasets, initially relatively free of noise. The experimental factors considered included the learner used, the level of injected noise, the dataset used (each with unique properties), and the percentage of minority instances containing noise. No other related studies were found that have identified learners that are robust in the presence of low quality measurement data. Based on the results of this study, we recommend using the random forest learner for building classification models from noisy data.

[1]  Ling Zhuang,et al.  Reducing performance Bias for Unbalanced Text Mining , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[2]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[3]  Salvatore J. Stolfo,et al.  A data mining framework for building intrusion detection models , 1999, Proceedings of the 1999 IEEE Symposium on Security and Privacy (Cat. No.99CB36344).

[4]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[5]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007 .

[6]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[7]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[8]  Leonard E. Trigg,et al.  Naive Bayes for regression , 1998 .

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[11]  Gary M. Weiss Learning with Rare Cases and Small Disjuncts , 1995, ICML.

[12]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[13]  Taghi M. Khoshgoftaar,et al.  Skewed Class Distributions and Mislabeled Examples , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[14]  Taghi M. Khoshgoftaar,et al.  The necessity of assuring quality in software measurement data , 2004, 10th International Symposium on Software Metrics, 2004. Proceedings..

[15]  David W. Aha,et al.  Lazy Learning , 1997, Springer Netherlands.

[16]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[17]  Xindong Wu,et al.  An Empirical Study of the Noise Impact on Cost-Sensitive Learning , 2007, IJCAI.

[18]  David M. Levine,et al.  Intermediate Statistical Methods and Applications: A Computer Package Approach , 1982 .

[19]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[20]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[22]  Taghi M. Khoshgoftaar,et al.  Class noise detection using frequent itemsets , 2006, Intell. Data Anal..

[23]  Taghi M. Khoshgoftaar,et al.  Detecting Noisy Instances with the Ensemble Filter: a Study in Software Quality Estimation , 2006, Int. J. Softw. Eng. Knowl. Eng..

[24]  Xindong Wu,et al.  Cost-guided class noise handling for effective cost-sensitive learning , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[25]  Taghi M. Khoshgoftaar,et al.  The pairwise attribute noise detection algorithm , 2007, Knowledge and Information Systems.

[26]  J. Blass,et al.  Symposium , 1979, The Lancet.

[27]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[28]  Taghi M. Khoshgoftaar,et al.  Detecting noisy instances with the rule-based classification model , 2005, Intell. Data Anal..

[29]  Zhaohui Wu,et al.  Enhancing Reliability throughout Knowledge Discovery Process , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[30]  Taghi M. Khoshgoftaar,et al.  Skewed Class Distributions and Mislabeled Examples , 2007 .

[31]  D. J. Hand,et al.  Good practice in retail credit scorecard assessment , 2005, J. Oper. Res. Soc..

[32]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[33]  M. F. Fuller,et al.  Practical Nonparametric Statistics; Nonparametric Statistical Inference , 1973 .

[34]  Taghi M. Khoshgoftaar,et al.  Data quality in data mining and machine learning , 2007 .

[35]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.