Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations

Majority voting and averaging are common approaches employed to resolve annotator disagreements and derive single ground truth labels from multiple annotations. However, annotators may systematically disagree with one another, often reflecting their individual biases and values, especially in the case of subjective tasks such as detecting affect, aggression, and hate speech. Annotator disagreements may capture important nuances in such tasks that are often ignored while aggregating annotations to a single ground truth. In order to address this, we investigate the efficacy of multi-annotator models. In particular, our multi-task based approach treats predicting each annotators’ judgements as separate subtasks, while sharing a common learned representation of the task. We show that this approach yields same or better performance than aggregating labels in the data prior to training across seven different binary classification tasks. Our approach also provides a way to estimate uncertainty in predictions, which we demonstrate better correlate with annotation disagreements than traditional methods. Being able to model uncertainty is especially useful in deployment scenarios where knowing when not to make a prediction is important.

[1]  R. Plutchik A GENERAL PSYCHOEVOLUTIONARY THEORY OF EMOTION , 1980 .

[2]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[3]  Alan S. Cowen,et al.  GoEmotions: A Dataset of Fine-Grained Emotions , 2020, ACL.

[4]  Bob Carpenter,et al.  The Benefits of a Model of Annotation , 2013, Transactions of the Association for Computational Linguistics.

[5]  Emily Denton,et al.  Social Biases in NLP Models as Barriers for Persons with Disabilities , 2020, ACL.

[6]  Klaus Krippendorff,et al.  Agreement and Information in the Reliability of Coding , 2011 .

[7]  Ingmar Weber,et al.  Understanding Abuse: A Typology of Abusive Language Detection Subtasks , 2017, ALW@ACL.

[8]  Eduard Hovy,et al.  Emotion Recognition in Conversation: Research Challenges, Datasets, and Recent Advances , 2019, IEEE Access.

[9]  Udo Kruschwitz,et al.  Comparing Bayesian Models of Annotation , 2018, TACL.

[10]  Zeerak Waseem,et al.  Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter , 2016, NLP+CSS@EMNLP.

[11]  Philipp Blandfort,et al.  Annotating Social Media Data From Vulnerable Populations: Evaluating Disagreement Between Domain Experts and Graduate Student Annotators , 2019, HICSS.

[12]  Alan S. Cowen,et al.  Mapping the Passions: Toward a High-Dimensional Taxonomy of Emotional Experience and Expression , 2019, Psychological science in the public interest : a journal of the American Psychological Society.

[13]  Margaret Mitchell,et al.  Perturbation Sensitivity Analysis to Detect Unintended Model Biases , 2019, EMNLP.

[14]  Walter Karlen,et al.  CXPlain: Causal Explanations for Model Interpretation under Uncertainty , 2019, NeurIPS.

[15]  Douwe Kiela,et al.  Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection , 2021, Annual Meeting of the Association for Computational Linguistics.

[16]  Yiwei Luo,et al.  DeSMOG: Detecting Stance in Media On Global Warming , 2020, FINDINGS.

[17]  Cecilia Ovesdotter Alm Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications , 2011, ACL.

[18]  Helen Yannakoudakis,et al.  Tackling Online Abuse: A Survey of Automated Abuse Detection Methods , 2019, ArXiv.

[19]  Bertie Vidgen,et al.  Online Abuse and Human Rights: WOAH Satellite Session at RightsCon 2020 , 2020, ALW.

[20]  Jeffrey Heer,et al.  Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks , 2016, CSCW.

[21]  Julia Hirschberg,et al.  Experiments in Emotional Speech , 2003 .

[22]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[23]  Aida Mostafazadeh Davani,et al.  The Gab Hate Corpus: A collection of 27k posts annotated for hate speech , 2018 .

[24]  Henry Lieberman,et al.  A model of textual affect sensing using real-world knowledge , 2003, IUI '03.

[25]  Margaret Lech,et al.  Modeling subjectiveness in emotion recognition with deep neural networks: Ensembles vs soft labels , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[26]  Yejin Choi,et al.  Challenges in Automated Debiasing for Toxic Language Detection , 2021, EACL.

[27]  Ge Jin,et al.  Identifying Personal Experience Tweets of Medication Effects Using Pre-trained RoBERTa Language Model and Its Updating , 2020, LOUHI.

[28]  Christopher Potts,et al.  Did It Happen? The Pragmatic Complexity of Veridicality Assessment , 2012, CL.

[29]  Yejin Choi,et al.  The Risk of Racial Bias in Hate Speech Detection , 2019, ACL.

[30]  Christine D. Piatko,et al.  Statistical Modality Tagging from Rule-based Annotations and Crowdsourcing , 2012, ExProM@ACL.

[31]  Dirk Hovy,et al.  Learning Whom to Trust with MACE , 2013, NAACL.

[32]  Julia Hirschberg,et al.  Detecting Hate Speech on the World Wide Web , 2012 .

[33]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[34]  Xiaodong Liu,et al.  Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.

[35]  Carlo Strapparava,et al.  SemEval-2007 Task 14: Affective Text , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[36]  Hugo Liu,et al.  A Corpus-based Approach to Finding Happiness , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[37]  Anca Dumitrache Crowdsourcing Disagreement for Collecting Semantic Annotation , 2015, ESWC.

[38]  Bill Tomlinson,et al.  Who are the crowdworkers?: shifting demographics in mechanical turk , 2010, CHI Extended Abstracts.

[39]  Julia Hirschberg,et al.  Classifying subject ratings of emotional speech using acoustic features , 2003, INTERSPEECH.

[40]  Tamsyn P. Waterhouse,et al.  Pay by the bit: an information-theoretic metric for collective human judgment , 2012, AAAI Fall Symposium: Machine Aggregation of Human Judgment.

[41]  K. Fiedler,et al.  Social Cognition: How Individuals Construct Social Reality , 2004 .

[42]  David Jurgens,et al.  A Just and Comprehensive Strategy for Using NLP to Address Online Abuse , 2019, ACL.

[43]  Carlos Busso,et al.  Interpreting ambiguous emotional expressions , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[44]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[45]  Bing Liu,et al.  Sentiment Analysis and Subjectivity , 2010, Handbook of Natural Language Processing.

[46]  Véronique Hoste,et al.  Emotion detection in suicide notes , 2013, Expert Syst. Appl..

[47]  Janyce Wiebe,et al.  Learning Subjective Language , 2004, CL.

[48]  P. Ekman An argument for basic emotions , 1992 .

[49]  Kalina Bontcheva,et al.  Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines , 2014, LREC.

[50]  Dirk Hovy,et al.  Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.

[51]  Stefanie Nowak,et al.  How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation , 2010, MIR '10.

[52]  Dirk Hovy,et al.  Learning part-of-speech taggers with inter-annotator agreement loss , 2014, EACL.

[53]  Lucas Dixon,et al.  Six Attributes of Unhealthy Conversations , 2020, ALW.

[54]  Tong Liu,et al.  Human-in-the-Loop Learning From Crowdsourcing and Social Media , 2020 .

[55]  Noel Crespi,et al.  A BERT-Based Transfer Learning Approach for Hate Speech Detection in Online Social Media , 2019, COMPLEX NETWORKS.

[56]  Michael Wiegand,et al.  A Survey on Hate Speech Detection using Natural Language Processing , 2017, SocialNLP@EACL.

[57]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[58]  Andrés Montoyo,et al.  Advances on natural language processing , 2007, Data Knowl. Eng..

[59]  Cecilia Ovesdotter Alm,et al.  Affect in Text and Speech , 2009 .

[60]  Vinodkumar Prabhakaran,et al.  On Releasing Annotator-Level Labels and Information in Datasets , 2021, LAW.

[61]  J. Russell Core affect and the psychological construction of emotion. , 2003, Psychological review.

[62]  Michael Kläs,et al.  Uncertainty in Machine Learning Applications: A Practice-Driven Classification of Uncertainty , 2018, SAFECOMP Workshops.

[63]  Andrew Rosenberg,et al.  "sure, I Did the Right Thing": a System for Sarcasm Detection in Speech , 2013, INTERSPEECH.