Data Cleaning and Classification in the Presence of Label Noise with Class-Specific Autoencoder

We present a simple but effective method for data cleaning and classification in the presence of label noise. The key idea is to treat the data points with label noise as outliers of the class indicated by the corresponding noisy label. However, finding such dubious observations is challenging in general. We therefore propose to reduce their potential influence using feature learning method by class-specific autoencoder. Particularly, we learn for each class a feature space using all the samples labeled as that class, including those with noisy labels. Furthermore, in the case of high label noise, we propose a weighted class-specific autoencoder by considering the effect of each data point. To fully exploit the advantage of the learned feature space, we use a minimum reconstruction error based method for testing. Experiments on several datasets show that the proposed method achieves state-of-the-art performance on the related tasks with noisy labels.

[1]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Lance Chun Che Fung,et al.  Data Cleaning for Classification Using Misclassification Analysis , 2010, J. Adv. Comput. Intell. Intell. Informatics.

[3]  Lance Chun Che Fung,et al.  Data Cleaning Using Complementary Fuzzy Support Vector Machine Technique , 2016, ICONIP.

[4]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[5]  Rong Jin,et al.  Bayesian Active Distance Metric Learning , 2007, UAI.

[6]  Lawrence O. Hall,et al.  Active cleaning of label noise , 2016, Pattern Recognit..

[7]  Michael S. Bernstein,et al.  Embracing Error to Enable Rapid Crowdsourcing , 2016, CHI.

[8]  Feiping Nie,et al.  Robust Distance Metric Learning via Simultaneous L1-Norm Minimization and Maximization , 2014, ICML.

[9]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[10]  Carla E. Brodley,et al.  Strategic targeting of outliers for expert review , 2010 .

[11]  Charu C. Aggarwal,et al.  Outlier Analysis , 2013, Springer New York.

[12]  Dacheng Tao,et al.  Classification with Noisy Labels by Importance Reweighting , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  S. Shankar Sastry,et al.  Generalized principal component analysis (GPCA) , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Xiaoyang Tan,et al.  Bayesian Neighborhood Component Analysis , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[15]  Jian Pei,et al.  Distance metric learning using dropout: a structured regularization approach , 2014, KDD.

[16]  Xiaoyang Tan,et al.  Robust Distance Metric Learning in the Presence of Label Noise , 2014, AAAI.

[17]  Dong Wang,et al.  Robust Distance Metric Learning via Bayesian Inference , 2018, IEEE Transactions on Image Processing.

[18]  Lawrence O. Hall,et al.  Label-noise reduction with support vector machines , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[19]  Xiaoyang Tan,et al.  Label-Denoising Auto-encoder for Classification with Inaccurate Supervision Information , 2014, 2014 22nd International Conference on Pattern Recognition.