STGN: an Implicit Regularization Method for Learning with Noisy Labels in Natural Language Processing

Noisy labels are ubiquitous in natural language processing (NLP) tasks. Existing work, namely learning with noisy labels in NLP, is often limited to dedicated tasks or specific training procedures, making it hard to be widely used. To address this issue, SGD noise has been explored to provide a more general way to alleviate the effect of noisy labels by involving benign noise in the process of stochastic gradient descent. However, previous studies exert identical perturbation for all samples, which may cause overfitting on incorrect ones or optimizing correct ones inadequately. To facilitate this, we propose a novel stochastic tailor-made gradient noise (STGN), mitigating the effect of inherent label noise by introducing tailor-made benign noise for each sample. Specifically, we investigate multiple principles to precisely and stably discriminate correct samples from incorrect ones and thus apply different intensities of perturbation to them. A detailed theoretical analysis shows that STGN has good properties, beneficial for model generalization. Experiments on three different NLP tasks demonstrate the effectiveness and versatility of STGN. Also, STGN can boost existing robust training methods.

[1]  Jiawei Han,et al.  Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training , 2021, EMNLP.

[2]  Muhao Chen,et al.  Learning from Noisy Labels for Entity-Centric Information Extraction , 2021, EMNLP.

[3]  Dietrich Klakow,et al.  Analysing the Noise Model Error for Realistic Noisy Label Data , 2021, AAAI.

[4]  Chris Callison-Burch,et al.  Reasoning about Goals, Steps, and Temporal Ordering with WikiHow , 2020, EMNLP.

[5]  Colin Wei,et al.  Shape Matters: Understanding the Implicit Bias of the Noise Covariance , 2020, COLT.

[6]  Ankit Singh Rawat,et al.  Can gradient clipping mitigate label noise? , 2020, ICLR.

[7]  Aleksandra Gabryszak,et al.  TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task , 2020, ACL.

[8]  Aditya Krishna Menon,et al.  Does label smoothing mitigate label noise? , 2020, ICML.

[9]  Manish Munikar,et al.  Fine-grained Sentiment Classification using BERT , 2019, 2019 Artificial Intelligence for Transforming Business and Society (AITB).

[10]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[11]  Xinyan Xiao,et al.  ARNOR: Attention Regularization based Noise Reduction for Distant Supervision Relation Classification , 2019, ACL.

[12]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[13]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[14]  A. Korhonen,et al.  Distant Learning for Entity Linking with Automatic Noise Detection , 2019, ACL.

[15]  Daniel Pressel,et al.  An Effective Label Noise Model for DNN Text Classification , 2019, NAACL.

[16]  Xingrui Yu,et al.  SIGUA: Forgetting May Make Learning with Noisy Labels More Robust , 2018, ICML.

[17]  Yoshua Bengio,et al.  An Empirical Study of Example Forgetting during Deep Neural Network Learning , 2018, ICLR.

[18]  Mert R. Sabuncu,et al.  Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels , 2018, NeurIPS.

[19]  Masashi Sugiyama,et al.  Co-teaching: Robust training of deep neural networks with extremely noisy labels , 2018, NeurIPS.

[20]  Li Fei-Fei,et al.  MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels , 2017, ICML.

[21]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[22]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[23]  Stefano Soatto,et al.  Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).

[24]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[25]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[26]  Richard Nock,et al.  Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[28]  Quoc V. Le,et al.  Adding Gradient Noise Improves Learning for Very Deep Networks , 2015, ArXiv.

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Eric Breck,et al.  Opinion Mining and Sentiment Analysis , 2014 .

[31]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[32]  Pheng-Ann Heng,et al.  Noise against noise: stochastic label noise helps combat inherent label noise , 2021, ICLR.

[33]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[34]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[35]  Alon Gonen Understanding Machine Learning From Theory to Algorithms 1st Edition Shwartz Solutions Manual , 2015 .