SemEval-2021 Task 12: Learning with Disagreements

Disagreement between coders is ubiquitous in virtually all datasets annotated with human judgements in both natural language processing and computer vision. However, most supervised machine learning methods assume that a single preferred interpretation exists for each item, which is at best an idealization. The aim of the SemEval-2021 shared task on learning with disagreements (Le-Wi-Di) was to provide a unified testing framework for methods for learning from data containing multiple and possibly contradictory annotations covering the best-known datasets containing information about disagreements for interpreting language and classifying images. In this paper we describe the shared task and its results.

[1]  Yufang Hou Incremental Fine-grained Information Status Classification Using Attention-based LSTMs , 2016, COLING.

[2]  Dirk Hovy,et al.  Learning part-of-speech taggers with inter-annotator agreement loss , 2014, EACL.

[3]  Yuchen Zhang,et al.  CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes , 2012, EMNLP-CoNLL Shared Task.

[4]  L. W. Kline The Psychology of Humor , 1907 .

[5]  Lourdes Agapito,et al.  DiverseNet: When One Right Answer is not Enough , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Beata Beigman Klebanov,et al.  Squibs: From Annotator Agreement to Noise Models , 2009, CL.

[8]  Thanet Markchom,et al.  UOR at SemEval-2021 Task 12: On Crowd Annotations; Learning with Disagreements to optimise crowd truth , 2021, SEMEVAL.

[9]  Ron Artstein,et al.  The Reliability of Anaphoric Annotation, Reconsidered: Taking Ambiguity into Account , 2005, FCA@ACL.

[10]  Pietro Perona,et al.  Inferring Ground Truth from Subjective Labelling of Venus Images , 1994, NIPS.

[11]  Iryna Gurevych,et al.  Predicting Humorousness and Metaphor Novelty with Gaussian Process Preference Learning , 2019, ACL.

[12]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[13]  Eduard Hovy,et al.  Identity, non-identity, and near-identity: Addressing the complexity of coreference , 2011 .

[14]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[15]  Luke S. Zettlemoyer,et al.  Higher-Order Coreference Resolution with Coarse-to-Fine Inference , 2018, NAACL.

[16]  Constantin Orasan,et al.  Annotating Near-Identity from Coreference Disagreements , 2012, LREC.

[17]  Michael Strube,et al.  Collective Classification for Fine-grained Information Status , 2012, ACL.

[18]  Iryna Gurevych,et al.  Noise or additional information? Leveraging crowdsource annotation item agreement for natural language tasks. , 2015, EMNLP.

[19]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[20]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[21]  Barbara Plank,et al.  Learning to parse with IAA-weighted loss , 2015, HLT-NAACL.

[22]  Ellie Pavlick,et al.  Inherent Disagreements in Human Textual Inferences , 2019, Transactions of the Association for Computational Linguistics.

[23]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[24]  Nancy Ide,et al.  Multiplicity and word sense: evaluating and learning from multiply labeled word sense annotations , 2012, Lang. Resour. Evaluation.

[25]  Iryna Gurevych,et al.  Scalable Bayesian preference learning for crowds , 2019, Machine Learning.

[26]  Yannick Versley,et al.  Vagueness and Referential Ambiguity in a Large-Scale Annotated Corpus , 2008 .

[27]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..

[28]  Udo Kruschwitz,et al.  A Crowdsourced Corpus of Multiple Judgments and Disagreement on Anaphoric Interpretation , 2019, NAACL.

[29]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[30]  Dirk Hovy,et al.  Learning Whom to Trust with MACE , 2013, NAACL.

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32]  Michael Strube,et al.  Global Inference for Bridging Anaphora Resolution , 2013, NAACL.

[33]  Udo Kruschwitz,et al.  Phrase detectives: Utilizing collective intelligence for internet-scale language resource creation , 2013, TIIS.

[34]  Barbara Plank,et al.  What to do about non-standard (or non-canonical) language in NLP , 2016, KONVENS.

[35]  Derek Ruths,et al.  Sentiment Analysis: It’s Complicated! , 2018, NAACL.

[36]  Daniel Hernández-Lobato,et al.  Ambiguity Helps: Classification with Disagreements in Crowdsourced Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Thomas L. Griffiths,et al.  Human Uncertainty Makes Classification More Robust , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Francisco C. Pereira,et al.  Deep learning from crowds , 2017, AAAI.

[39]  Dirk Hovy,et al.  Linguistically debatable or just plain wrong? , 2014, ACL.

[40]  Dirk Hovy,et al.  A Case for Soft Loss Functions , 2020, HCOMP.

[41]  Hossein Mobahi,et al.  Sharpness-Aware Minimization for Efficiently Improving Generalization , 2020, ArXiv.

[42]  Lora Aroyo,et al.  Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation , 2015, AI Mag..

[43]  Barbara Plank,et al.  Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss , 2016, ACL.

[44]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[45]  Lora Aroyo,et al.  A Crowdsourced Frame Disambiguation Corpus with Ambiguity , 2019, NAACL.