Beyond Black & White: Leveraging Annotator Disagreement via Soft-Label Multi-Task Learning

Supervised learning assumes that a ground truth label exists. However, the reliability of this ground truth depends on human annotators, who often disagree. Prior work has shown that this disagreement can be helpful in training models. We propose a novel method to incorporate this disagreement as information: in addition to the standard error computation, we use soft-labels (i.e., probability distributions over the annotator labels) as an auxiliary task in a multi-task neural network. We measure the divergence between the predictions and the target soft-labels with several loss-functions and evaluate the models on various NLP tasks. We find that the soft-label prediction auxiliary task reduces the penalty for errors on ambiguous entities, and thereby mitigates overfitting. It significantly improves performance across tasks, beyond the standard approach and prior work.

[1]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[2]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[3]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[4]  Jacob Goldberger,et al.  Training deep neural-networks using a noise adaptation layer , 2016, ICLR.

[5]  Hong Yu,et al.  Soft Label Memorization-Generalization for Natural Language Inference , 2017 .

[6]  Dirk Hovy,et al.  What’s in a p-value in NLP? , 2014, CoNLL.

[7]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[8]  Ellie Pavlick,et al.  Inherent Disagreements in Human Textual Inferences , 2019, Transactions of the Association for Computational Linguistics.

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[11]  Diyi Yang,et al.  The Importance of Modeling Social Factors of Language: Theory and Practice , 2021, NAACL.

[12]  Lora Aroyo,et al.  Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation , 2015, AI Mag..

[13]  Aggelos K. Katsaggelos,et al.  Learning from crowds with variational Gaussian processes , 2019, Pattern Recognit..

[14]  Yueting Zhuang,et al.  Discourse Marker Augmented Network with Reinforcement Learning for Natural Language Inference , 2018, ACL.

[15]  Barbara Plank,et al.  Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss , 2016, ACL.

[16]  Francisco C. Pereira,et al.  Deep learning from crowds , 2017, AAAI.

[17]  Ron Artstein,et al.  The Reliability of Anaphoric Annotation, Reconsidered: Taking Ambiguity into Account , 2005, FCA@ACL.

[18]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Ivor W. Tsang,et al.  Masking: A New Perspective of Noisy Supervision , 2018, NeurIPS.

[21]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[22]  Yejin Choi,et al.  The Risk of Racial Bias in Hate Speech Detection , 2019, ACL.

[23]  Dirk Hovy,et al.  Learning Whom to Trust with MACE , 2013, NAACL.

[24]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[25]  Iryna Gurevych,et al.  Noise or additional information? Leveraging crowdsource annotation item agreement for natural language tasks. , 2015, EMNLP.

[26]  Dirk Hovy,et al.  Learning part-of-speech taggers with inter-annotator agreement loss , 2014, EACL.

[27]  Dirk Hovy,et al.  Experiments with crowdsourced re-annotation of a POS tagging data set , 2014, ACL.

[28]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[29]  Dirk Hovy,et al.  Predictive Biases in Natural Language Processing Models: A Conceptual Framework and Overview , 2019, ACL.

[30]  Xingrui Yu,et al.  Co-teaching: Robust training of deep neural networks with extremely noisy labels , 2018, NeurIPS.

[31]  Inc. Alias-i Multilevel Bayesian Models of Categorical Data Annotation , 2008 .

[32]  Lucia Specia,et al.  Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation , 2013, ACL.

[33]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[34]  Daniel Zeman Hard Problems of Tagset Conversion , 2009 .

[35]  Udo Kruschwitz,et al.  Comparing Bayesian Models of Annotation , 2018, TACL.

[36]  Beata Beigman Klebanov,et al.  Difficult Cases: From Data to Learning, and Back , 2014, ACL.

[37]  Dan Klein,et al.  An Empirical Investigation of Statistical Significance in NLP , 2012, EMNLP.