Detecting Label Errors by Using Pre-Trained Language Models

We show that large pre-trained language models are inherently highly capable of identifying label errors in natural language datasets: simply examining out-of-sample data points in descending order of fine-tuned task loss significantly outperforms more complex error-detection mechanisms proposed in previous work. To this end, we contribute a novel method for introducing realistic, human-originated label noise into existing crowdsourced datasets such as SNLI and TweetNLP. We show that this noise has similar properties to real, hand-verified label errors, and is harder to detect than existing synthetic noise, creating challenges for model robustness.We argue that human-originated noise is a better standard for evaluation than synthetic noise. Finally, we use crowdsourced verification to evaluate the detection of real errors on IMDB, Amazon Reviews, and Recon, and confirm that pre-trained models perform at a 9–36% higher absolute Area Under the Precision-Recall Curve than existing models.

[1]  David Ifeoluwa Adelani,et al.  Is BERT Robust to Label Noise? A Study on Learning with Noisy Labels in Text Classification , 2022, INSIGHTS.

[2]  Ángel Alexander Cabrera,et al.  Symphony: Composing Interactive Interfaces for Machine Learning , 2022, CHI.

[3]  Weizhu Chen,et al.  DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing , 2021, ICLR.

[4]  Liang Xu,et al.  DataCLUE: A Benchmark Suite for Data-centric NLP , 2021, ArXiv.

[5]  Ehsan Amid,et al.  Constrained Instance and Class Reweighting for Robust Learning under Label Noise , 2021, ArXiv.

[6]  Tongliang Liu,et al.  Learning with Noisy Labels Revisited: A Study Using Real-World Human Annotations , 2021, ICLR.

[7]  Shyamgopal Karthik,et al.  Learning From Long-Tailed Data With Noisy Labels , 2021, ArXiv.

[8]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[9]  Yunyu Xiao,et al.  Quality control questions on Amazon’s Mechanical Turk (MTurk): A randomized trial of impact on the USAUDIT, PHQ-9, and GAD-7 , 2021, Behavior Research Methods.

[10]  Frederick Liu,et al.  Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles , 2021, NeurIPS.

[11]  J. Z. Kolter,et al.  Assessing Generalization of SGD via Disagreement , 2021, ICLR.

[12]  Robi Polikar,et al.  Rethinking Noisy Label Models: Labeler-Dependent Noise with Adversarial Awareness , 2021, ArXiv.

[13]  Noura Al Moubayed,et al.  Agree to Disagree: When Deep Learning Models With Identical Architectures Produce Distinct Explanations , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[14]  Muhao Chen,et al.  Learning from Noisy Labels for Entity-Centric Information Extraction , 2021, EMNLP.

[15]  Karl Stratos,et al.  Understanding Hard Negatives in Noise Contrastive Estimation , 2021, NAACL.

[16]  Jonas Mueller,et al.  Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks , 2021, NeurIPS Datasets and Benchmarks.

[17]  Ankur Bapna,et al.  Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets , 2021, TACL.

[18]  Raja Giryes,et al.  Multiplicative Reweighting for Robust Neural Network Optimization , 2021, ArXiv.

[19]  Se-Young Yun,et al.  FINE Samples for Learning with Noisy Labels , 2021, NeurIPS.

[20]  A. Saravanos,et al.  The Hidden Cost of Using Amazon Mechanical Turk for Research , 2021, HCI.

[21]  Pheng-Ann Heng,et al.  Beyond Class-Conditional Assumption: A Primary Attempt to Combat Instance-Dependent Label Noise , 2020, AAAI.

[22]  Frederick Reiss,et al.  Identifying Incorrect Labels in the CoNLL-2003 Corpus , 2020, CONLL.

[23]  Hwanjun Song,et al.  Learning From Noisy Labels With Deep Neural Networks: A Survey , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[24]  James Bailey,et al.  Normalized Loss Functions for Deep Learning with Noisy Labels , 2020, ICML.

[25]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[26]  Dawn Song,et al.  Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.

[27]  Görkem Algan,et al.  Label Noise Types and Their Effects on Deep Learning , 2020, ArXiv.

[28]  Shrey Desai,et al.  Calibration of Pre-trained Transformers , 2020, EMNLP.

[29]  W. Bickel,et al.  Mechanical Turk Data Collection in Addiction Research: Utility, Concerns and Best Practices. , 2020, Addiction.

[30]  Isaac L. Chuang,et al.  Confident Learning: Estimating Uncertainty in Dataset Labels , 2019, J. Artif. Intell. Res..

[31]  Yang Liu,et al.  Peer Loss Functions: Learning from Noisy Labels without Knowing Noise Rates , 2019, ICML.

[32]  Binqiang Zhao,et al.  O2U-Net: A Simple Noisy Label Detection Approach for Deep Neural Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[34]  Nicolas M. Müller,et al.  Identifying Mislabeled Instances in Classification Datasets , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[35]  Manfred K. Warmuth,et al.  Robust Bi-Tempered Logistic Loss Based on Bregman Divergences , 2019, NeurIPS.

[36]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[37]  Sean A. Dennis,et al.  Online Worker Fraud and Evolving Threats to the Integrity of MTurk Data: A Discussion of Virtual Private Servers and the Limitations of IP-Based Screening Procedures , 2019, Behavioral Research in Accounting.

[38]  Philip D. Waggoner,et al.  The shape of and solutions to the MTurk quality crisis , 2018, Political Science Research and Methods.

[39]  Samuel R. Bowman,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[40]  Kevin Gimpel,et al.  Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise , 2018, NeurIPS.

[41]  Li Fei-Fei,et al.  MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels , 2017, ICML.

[42]  Nir Shavit,et al.  Deep Learning is Robust to Massive Label Noise , 2017, ArXiv.

[43]  Victor S. Sheng,et al.  Consensus algorithms for biased labeling in crowdsourcing , 2017, Inf. Sci..

[44]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[45]  Michael Stonebraker,et al.  Detecting Data Errors: Where are we and what needs to be done? , 2016, Proc. VLDB Endow..

[46]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[47]  Trevor Darrell,et al.  Auxiliary Image Regularization for Deep CNNs with Noisy Labels , 2015, ICLR.

[48]  Allan Jabri,et al.  Learning Visual Features from Large Weakly Supervised Data , 2015, ECCV.

[49]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[50]  Anton van den Hengel,et al.  Image-Based Recommendations on Styles and Substitutes , 2015, SIGIR.

[51]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[52]  Dumitru Erhan,et al.  Training Deep Neural Networks on Noisy Labels with Bootstrapping , 2014, ICLR.

[53]  Dirk Hovy,et al.  Experiments with crowdsourced re-annotation of a POS tagging data set , 2014, ACL.

[54]  Dirk Hovy,et al.  Linguistically debatable or just plain wrong? , 2014, ACL.

[55]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[56]  D. Gamberger,et al.  Ensemble-based noise detection: noise ranking and visual performance evaluation , 2014, Data Mining and Knowledge Discovery.

[57]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[58]  Brian Mac Namee,et al.  Profiling instances in noise reduction , 2012, Knowl. Based Syst..

[59]  Hassan H. Malik,et al.  Automatic Training Data Cleaning for Text Classification , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[60]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[61]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[62]  Yanchun Zhang,et al.  Support Vector Machine for Outlier Detection in Breast Cancer Survivability Prediction , 2008, APWeb Workshops.

[63]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[64]  Virginia Wheway,et al.  Using Boosting to Detect Noisy Data , 2000, PRICAI Workshops.

[65]  Saso Dzeroski,et al.  Noise detection and elimination in data preprocessing: Experiments in medical domains , 2000, Appl. Artif. Intell..

[66]  Thomas Redman,et al.  The impact of poor data quality on the typical enterprise , 1998, CACM.

[67]  Gavin C. Rozzi,et al.  The Dark Side of Sentiment Analysis: An Exploratory Review Using Lexicons, Dictionaries, and a Statistical Monkey and Chimp , 2022, SSRN Electronic Journal.

[68]  Christopher Manning,et al.  Learning from Limited Labels for Long Legal Dialogue , 2021, NLLP.

[69]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .