论文信息 - Detecting Label Errors by Using Pre-Trained Language Models - 字舞流文

Detecting Label Errors by Using Pre-Trained Language Models

We show that large pre-trained language models are inherently highly capable of identifying label errors in natural language datasets: simply examining out-of-sample data points in descending order of fine-tuned task loss significantly outperforms more complex error-detection mechanisms proposed in previous work. To this end, we contribute a novel method for introducing realistic, human-originated label noise into existing crowdsourced datasets such as SNLI and TweetNLP. We show that this noise has similar properties to real, hand-verified label errors, and is harder to detect than existing synthetic noise, creating challenges for model robustness.We argue that human-originated noise is a better standard for evaluation than synthetic noise. Finally, we use crowdsourced verification to evaluate the detection of real errors on IMDB, Amazon Reviews, and Recon, and confirm that pre-trained models perform at a 9–36% higher absolute Area Under the Precision-Recall Curve than existing models.

Christopher D. Manning | Derek Chong | Jenny Hong

[1] David Ifeoluwa Adelani,et al. Is BERT Robust to Label Noise? A Study on Learning with Noisy Labels in Text Classification , 2022, INSIGHTS.

[2] Ángel Alexander Cabrera,et al. Symphony: Composing Interactive Interfaces for Machine Learning , 2022, CHI.

[3] Weizhu Chen,et al. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing , 2021, ICLR.

[4] Liang Xu,et al. DataCLUE: A Benchmark Suite for Data-centric NLP , 2021, ArXiv.

[5] Ehsan Amid,et al. Constrained Instance and Class Reweighting for Robust Learning under Label Noise , 2021, ArXiv.

[6] Tongliang Liu,et al. Learning with Noisy Labels Revisited: A Study Using Real-World Human Annotations , 2021, ICLR.

[7] Shyamgopal Karthik,et al. Learning From Long-Tailed Data With Noisy Labels , 2021, ArXiv.

[8] Michael S. Bernstein,et al. On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[9] Yunyu Xiao,et al. Quality control questions on Amazon’s Mechanical Turk (MTurk): A randomized trial of impact on the USAUDIT, PHQ-9, and GAD-7 , 2021, Behavior Research Methods.

[10] Frederick Liu,et al. Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles , 2021, NeurIPS.

[11] J. Z. Kolter,et al. Assessing Generalization of SGD via Disagreement , 2021, ICLR.

[12] Robi Polikar,et al. Rethinking Noisy Label Models: Labeler-Dependent Noise with Adversarial Awareness , 2021, ArXiv.

[13] Noura Al Moubayed,et al. Agree to Disagree: When Deep Learning Models With Identical Architectures Produce Distinct Explanations , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[14] Muhao Chen,et al. Learning from Noisy Labels for Entity-Centric Information Extraction , 2021, EMNLP.

[15] Karl Stratos,et al. Understanding Hard Negatives in Noise Contrastive Estimation , 2021, NAACL.

[16] Jonas Mueller,et al. Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks , 2021, NeurIPS Datasets and Benchmarks.

[17] Ankur Bapna,et al. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets , 2021, TACL.

[18] Raja Giryes,et al. Multiplicative Reweighting for Robust Neural Network Optimization , 2021, ArXiv.

[19] Se-Young Yun,et al. FINE Samples for Learning with Noisy Labels , 2021, NeurIPS.

[20] A. Saravanos,et al. The Hidden Cost of Using Amazon Mechanical Turk for Research , 2021, HCI.

[21] Pheng-Ann Heng,et al. Beyond Class-Conditional Assumption: A Primary Attempt to Combat Instance-Dependent Label Noise , 2020, AAAI.

[22] Frederick Reiss,et al. Identifying Incorrect Labels in the CoNLL-2003 Corpus , 2020, CONLL.

[23] Hwanjun Song,et al. Learning From Noisy Labels With Deep Neural Networks: A Survey , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[24] James Bailey,et al. Normalized Loss Functions for Deep Learning with Noisy Labels , 2020, ICML.

[25] Doug Downey,et al. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[26] Dawn Song,et al. Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.

[27] Görkem Algan,et al. Label Noise Types and Their Effects on Deep Learning , 2020, ArXiv.

[28] Shrey Desai,et al. Calibration of Pre-trained Transformers , 2020, EMNLP.

[29] W. Bickel,et al. Mechanical Turk Data Collection in Addiction Research: Utility, Concerns and Best Practices. , 2020, Addiction.

[30] Isaac L. Chuang,et al. Confident Learning: Estimating Uncertainty in Dataset Labels , 2019, J. Artif. Intell. Res..

[31] Yang Liu,et al. Peer Loss Functions: Learning from Noisy Labels without Knowing Noise Rates , 2019, ICML.

[32] Binqiang Zhao,et al. O2U-Net: A Simple Noisy Label Detection Approach for Deep Neural Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33] Sebastian Riedel,et al. Language Models as Knowledge Bases? , 2019, EMNLP.

[34] Nicolas M. Müller,et al. Identifying Mislabeled Instances in Classification Datasets , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[35] Manfred K. Warmuth,et al. Robust Bi-Tempered Logistic Loss Based on Bregman Divergences , 2019, NeurIPS.

[36] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[37] Sean A. Dennis,et al. Online Worker Fraud and Evolving Threats to the Integrity of MTurk Data: A Discussion of Virtual Private Servers and the Limitations of IP-Based Screening Procedures , 2019, Behavioral Research in Accounting.

[38] Philip D. Waggoner,et al. The shape of and solutions to the MTurk quality crisis , 2018, Political Science Research and Methods.

[39] Samuel R. Bowman,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[40] Kevin Gimpel,et al. Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise , 2018, NeurIPS.

[41] Li Fei-Fei,et al. MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels , 2017, ICML.

[42] Nir Shavit,et al. Deep Learning is Robust to Massive Label Noise , 2017, ArXiv.

[43] Victor S. Sheng,et al. Consensus algorithms for biased labeling in crowdsourcing , 2017, Inf. Sci..

[44] Kevin Gimpel,et al. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[45] Michael Stonebraker,et al. Detecting Data Errors: Where are we and what needs to be done? , 2016, Proc. VLDB Endow..

[46] Tomas Mikolov,et al. Bag of Tricks for Efficient Text Classification , 2016, EACL.

[47] Trevor Darrell,et al. Auxiliary Image Regularization for Deep CNNs with Noisy Labels , 2015, ICLR.

[48] Allan Jabri,et al. Learning Visual Features from Large Weakly Supervised Data , 2015, ECCV.

[49] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.

[50] Anton van den Hengel,et al. Image-Based Recommendations on Styles and Substitutes , 2015, SIGIR.

[51] Takaya Saito,et al. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[52] Dumitru Erhan,et al. Training Deep Neural Networks on Noisy Labels with Bootstrapping , 2014, ICLR.

[53] Dirk Hovy,et al. Experiments with crowdsourced re-annotation of a POS tagging data set , 2014, ACL.

[54] Dirk Hovy,et al. Linguistically debatable or just plain wrong? , 2014, ACL.

[55] M. Verleysen,et al. Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[56] D. Gamberger,et al. Ensemble-based noise detection: noise ranking and visual performance evaluation , 2014, Data Mining and Knowledge Discovery.

[57] Brendan T. O'Connor,et al. Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[58] Brian Mac Namee,et al. Profiling instances in noise reduction , 2012, Knowl. Based Syst..

[59] Hassan H. Malik,et al. Automatic Training Data Cleaning for Text Classification , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[60] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[61] Brendan T. O'Connor,et al. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[62] Yanchun Zhang,et al. Support Vector Machine for Outlier Detection in Breast Cancer Survivability Prediction , 2008, APWeb Workshops.

[63] Brendan T. O'Connor,et al. Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[64] Virginia Wheway,et al. Using Boosting to Detect Noisy Data , 2000, PRICAI Workshops.

[65] Saso Dzeroski,et al. Noise detection and elimination in data preprocessing: Experiments in medical domains , 2000, Appl. Artif. Intell..

[66] Thomas Redman,et al. The impact of poor data quality on the typical enterprise , 1998, CACM.

[67] Gavin C. Rozzi,et al. The Dark Side of Sentiment Analysis: An Exploratory Review Using Lexicons, Dictionaries, and a Statistical Monkey and Chimp , 2022, SSRN Electronic Journal.

[68] Christopher Manning,et al. Learning from Limited Labels for Long Legal Dialogue , 2021, NLLP.

[69] Danqi Chen,et al. of the Association for Computational Linguistics: , 2001 .