Classification algorithm sensitivity to training data with non representative attribute noise

We present an empirical comparison of classification algorithms when training data contains attribute noise levels not representative of field data. To study algorithm sensitivity, we develop an innovative experimental design using noise situation, algorithm, noise level, and training set size as factors. Our results contradict conventional wisdom indicating that investments to achieve representative noise levels may not be worthwhile. In general, over representative training noise should be avoided while under representative training noise is less of a concern. However, interactions among algorithm, noise level, and training set size indicate that these general results may not apply to particular practice situations.

[1]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[2]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[3]  Ian Witten,et al.  Data Mining , 2000 .

[4]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[5]  G DietterichThomas An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees , 2000 .

[6]  Jeffrey S. Simonoff,et al.  Tree Induction Vs Logistic Regression: A Learning Curve Analysis , 2001, J. Mach. Learn. Res..

[7]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[8]  Ryszard S. Michalski,et al.  Machine learning: an artificial intelligence approach volume III , 1990 .

[9]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[10]  Z. Šidák Rectangular Confidence Regions for the Means of Multivariate Normal Distributions , 1967 .

[11]  Sally A. Goldman,et al.  Can PAC learning algorithms tolerate random attribute noise? , 1995, Algorithmica.

[12]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[13]  Yves Kodratoff Proceedings of the European Working Session on Machine Learning , 1991 .

[14]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[15]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[16]  Ivan Bratko,et al.  On Estimating Probabilities in Tree Pruning , 1991, EWSL.

[17]  Scott E. Decatur Learning in Hybrid Noise Environments Using Statistical Queries , 1995, AISTATS.

[18]  Geoffrey I. Webb,et al.  On the Application of ROC Analysis to Predict Classification Performance Under Varying Class Distributions , 2005, Machine Learning.

[19]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[20]  Xindong Wu,et al.  A logical framework for identifying quality knowledge from different data sources , 2006, Decis. Support Syst..

[21]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[22]  Thomas Redman,et al.  Data quality for telecommunications , 1994, IEEE J. Sel. Areas Commun..

[23]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[24]  Binshan Lin,et al.  Accessing information sharing and information quality in supply chain management , 2006, Decis. Support Syst..

[25]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[26]  Thomas Redman,et al.  Data quality for the information age , 1996 .

[27]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[28]  Peter A. Flach,et al.  A Response to Webb and Ting’s On the Application of ROC Analysis to Predict Classification Performance Under Varying Class Distributions , 2005, Machine Learning.

[29]  Won Kim,et al.  Towards Quantifying Data Quality Costs , 2003, J. Object Technol..

[30]  Ken Orr,et al.  Data quality and systems theory , 1998, CACM.

[31]  Nader H. Bshouty,et al.  Uniform-distribution attribute noise learnability , 2003, Inf. Comput..

[32]  John Mingers,et al.  An Empirical Comparison of Pruning Methods for Decision Tree Induction , 1989, Machine Learning.

[33]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[34]  Eyal Kushilevitz,et al.  PAC learning with nasty noise , 1999, Theor. Comput. Sci..

[35]  Vijay S. Mookerjee,et al.  Improving the Performance Stability of Inductive Expert Systems Under Input Noise , 1995, Inf. Syst. Res..

[36]  James R. Nolan,et al.  Computer Systems That Learn: an Empirical Study of the Effect of Noise on the Performance of Three Classification Methods Computer Systems That Learn: an Empirical Study of the Effect of Noise on the Performance of Three Classification Methods , 2022 .

[37]  Thomas Redman,et al.  The impact of poor data quality on the typical enterprise , 1998, CACM.

[38]  P. Laird Learning from Good and Bad Data , 1988 .