Effect of label noise in the complexity of classification problems

Abstract Noisy data are common in real-world problems and may have several causes, like inaccuracies, distortions or contamination during data collection, storage and/or transmission. The presence of noise in data can affect the complexity of classification problems, making the discrimination of objects from different classes more difficult, and requiring more complex decision boundaries for data separation. In this paper, we investigate how noise affects the complexity of classification problems, by monitoring the sensitivity of several indices of data complexity in the presence of different label noise levels. To characterize the complexity of a classification dataset, we use geometric, statistical and structural measures extracted from data. The experimental results show that some measures are more sensitive than others to the addition of noise in a dataset. These measures can be used in the development of new preprocessing techniques for noise identification and novel label noise tolerant algorithms. We thereby show preliminary results on a new filter for noise identification, which is based on two of the complexity measures which were more sensitive to the presence of label noise.

[1]  Niloy Ganguly,et al.  Dynamics On and Of Complex Networks , 2009 .

[2]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[3]  Nada Lavrac,et al.  Ensemble-based noise detection: noise ranking and visual performance evaluation , 2012, Data Mining and Knowledge Discovery.

[4]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[5]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[6]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Noisy Data Set Identification , 2013, HAIS.

[7]  Yaser S. Abu-Mostafa,et al.  Data Complexity in Machine Learning , 2006 .

[8]  Sameer Singh,et al.  PRISM – A novel framework for pattern recognition , 2003, Pattern Analysis & Applications.

[9]  Eric D. Kolaczyk,et al.  Statistical Analysis of Network Data , 2009 .

[10]  Nada Lavrac,et al.  Advances in Class Noise Detection , 2010, ECAI.

[11]  I. Tomek An Experiment with the Edited Nearest-Neighbor Rule , 1976 .

[12]  Ronald Rosenfeld,et al.  Semi-supervised learning with graphs , 2005 .

[13]  Tin Kam Ho Data Complexity Analysis: Linkage between Context and Solution in Classification , 2008, SSPR/SPR.

[14]  Eleazar Eskin,et al.  Detecting Errors within a Corpus using Anomaly Detection , 2000, ANLP.

[15]  José Martínez Sotoca,et al.  Data Characterization for Effective Prototype Selection , 2005, IbPRIA.

[16]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[17]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[18]  Cesar H. Comin,et al.  A Systematic Comparison of Supervised Classifiers , 2013, PloS one.

[19]  Tony R. Martinez,et al.  An instance level analysis of data complexity , 2014, Machine Learning.

[20]  T. Martinez,et al.  An Efficient Metric for Heterogeneous Inductive Learning Applications in the Attribute-Value Language , 1995 .

[21]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Ronaldo C. Prati,et al.  Complex Network Measures for Data Set Characterization , 2013, 2013 Brazilian Conference on Intelligent Systems.

[23]  Joseph Picone,et al.  Support vector machines for automatic data cleanup , 2000, INTERSPEECH.

[24]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  A Study on Class Noise Detection and Elimination , 2012, 2012 Brazilian Symposium on Neural Networks.

[25]  Choh-Man Teng,et al.  Correcting Noisy Data , 1999, ICML.

[26]  WuXindong,et al.  Class noise vs. attribute noise , 2004 .

[27]  Núria Macià,et al.  Towards UCI+: A mindful repository design , 2014, Inf. Sci..

[28]  L. da F. Costa,et al.  Characterization of complex networks: A survey of measurements , 2005, cond-mat/0505185.

[29]  Padhraic Smyth,et al.  Knowledge Discovery and Data Mining: Towards a Unifying Framework , 1996, KDD.

[30]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[31]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[32]  Eric D. Kolaczyk,et al.  Statistical Analysis of Network Data: Methods and Models , 2009 .

[33]  Niloy Ganguly,et al.  Dynamics On and Of Complex Networks: Applications to Biology, Computer Science, and the Social Sciences , 2009 .

[34]  Francisco Herrera,et al.  Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification , 2013, Pattern Recognit..

[35]  Anneleen Van Assche,et al.  Ensemble Methods for Noise Elimination in Classification Problems , 2003, Multiple Classifier Systems.

[36]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[37]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.