Annotation Curricula to Implicitly Train Non-Expert Annotators

Annotation studies often require annotators to familiarize themselves with the task, its annotation scheme, and the data domain. This can be overwhelming in the beginning, mentally taxing, and induce errors into the resulting annotations; especially in citizen science or crowd sourcing scenarios where domain expertise is not required and only annotation guidelines are provided. To alleviate these issues, we propose annotation curricula, a novel approach to implicitly train annotators. Our goal is to gradually introduce annotators into the task by ordering instances that are annotated according to a learning curriculum. To do so, we first formalize annotation curricula for sentenceand paragraph-level annotation tasks, define an ordering strategy, and identify well-performing heuristics and interactively trained models on three existing English datasets. We then conduct a user study with 40 voluntary participants who are asked to identify the most fitting misconception for English tweets about the Covid-19 pandemic. Our results show that using a simple heuristic to order instances can already significantly reduce the total annotation time while preserving a high annotation quality. Annotation curricula thus can provide a novel way to improve data collection. To facilitate future research, we further share our code and data consisting of 2,400 annotations.1

[1]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[2]  Noah A. Smith,et al.  To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks , 2019, RepL4NLP@ACL.

[3]  Dacheng Tao,et al.  Active Learning for Crowdsourcing Using Knowledge Transfer , 2014, AAAI.

[4]  Zachary C. Lipton,et al.  Deep Bayesian Active Learning for Natural Language Processing: Results of a Large-Scale Empirical Study , 2018, EMNLP.

[5]  Hinrich Schütze,et al.  Active Learning with Amazon Mechanical Turk , 2011, EMNLP.

[6]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[7]  Mohit Bansal,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2020, ACL.

[8]  Iryna Gurevych,et al.  From Zero to Hero: Human-In-The-Loop Entity Linking in Low Resource Domains , 2020, ACL.

[9]  Wilson L. Taylor,et al.  “Cloze Procedure”: A New Tool for Measuring Readability , 1953 .

[10]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[11]  Zoubin Ghahramani,et al.  Deep Bayesian Active Learning with Image Data , 2017, ICML.

[12]  Yarin Gal,et al.  BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning , 2019, NeurIPS.

[13]  Peta Wyeth,et al.  GameFlow: a model for evaluating player enjoyment in games , 2005, CIE.

[14]  Xindong Wu,et al.  Self-Taught Active Learning from Crowds , 2012, 2012 IEEE 12th International Conference on Data Mining.

[15]  Christian M. Meyer,et al.  Manipulating the Difficulty of C-Tests , 2019, ACL.

[16]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[17]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[18]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[19]  Kamalika Chaudhuri,et al.  Active Learning from Weak and Strong Labelers , 2015, NIPS.

[20]  Benjamin Ka-Yin T'sou,et al.  Difficulty-aware Distractor Generation for Gap-Fill Items , 2019, ALTA.

[21]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[22]  Iryna Gurevych,et al.  Analysis of Automatic Annotation Suggestions for Hard Discourse-Level Tasks in Expert Domains , 2019, ACL.

[23]  N. Verhelst,et al.  Common European Framework of Reference for Languages: learning, teaching, assessment , 2009 .

[24]  Louise Deléger,et al.  Evaluating the impact of pre-annotation on annotation speed and potential bias: natural language processing gold standard development for clinical named entity recognition in clinical trial announcements , 2013, J. Am. Medical Informatics Assoc..

[25]  Masoud Jasbi,et al.  Linguistic Features for Readability Assessment , 2020, BEA@ACL.

[26]  Anna Rogers,et al.  Changing the World by Changing the Data , 2021, ACL.

[27]  Juntao Yu,et al.  Progression in a Language Annotation Game with a Purpose , 2019, HCOMP.

[28]  Mark Craven,et al.  Active Learning with Real Annotation Costs , 2008 .

[29]  L. S. Vygotskiĭ,et al.  Mind in society : the development of higher psychological processes , 1978 .

[30]  Johannes Daxenberger,et al.  The Influence of Input Data Complexity on Crowdsourcing Quality , 2020, IUI Companion.

[31]  Sameer Singh,et al.  COVIDLies: Detecting COVID-19 Misinformation on Social Media , 2020, NLP4COVID@EMNLP.

[32]  Iryna Gurevych,et al.  Automatic Annotation Suggestions and Custom Annotation Layers in WebAnno , 2014, ACL.

[33]  Kevin Gimpel,et al.  Distractor Analysis and Selection for Multiple-Choice Cloze Questions for Second-Language Learners , 2020, BEA.

[34]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[35]  Barbara Plank,et al.  Active learning for sense annotation , 2015, NODALIDA.

[36]  Manish Agarwal,et al.  Automatic Gap-fill Question Generation from Text Books , 2011, BEA@ACL.

[37]  Patrick Paroubek,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2010, LREC.

[38]  David Lowell,et al.  Practical Obstacles to Deploying Active Learning , 2019, EMNLP/IJCNLP.

[39]  Brandon M. Turner,et al.  The anchor integration model: A descriptive model of anchoring effects , 2016, Cognitive Psychology.

[40]  Yejin Choi,et al.  WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale , 2020, AAAI.

[41]  Iryna Gurevych,et al.  Predicting the Difficulty of Language Proficiency Tests , 2014, TACL.

[42]  Yoav Goldberg,et al.  Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets , 2019, EMNLP.

[43]  Hung Chau,et al.  Understanding the Tradeoff between Cost and Quality of Expert Annotations for Keyphrase Extraction , 2020, LAW.

[44]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[45]  Orestis Papakyriakopoulos,et al.  Bias in word embeddings , 2020, FAT*.

[46]  Alan W. Black,et al.  Should You Fine-Tune BERT for Automated Essay Scoring? , 2020, BEA.

[47]  George G. Szpiro Numbers Rule: The Vexing Mathematics of Democracy, from Plato to the Present , 2010 .

[48]  Sameer Singh,et al.  Detecting COVID-19 Misinformation on Social Media , 2020 .

[49]  B. L. Welch ON THE COMPARISON OF SEVERAL MEAN VALUES: AN ALTERNATIVE APPROACH , 1951 .

[50]  Hsuan-Tien Lin,et al.  Cold-start Active Learning through Self-Supervised Language Modeling , 2020, EMNLP.

[51]  Davis Liang,et al.  Masked Language Model Scoring , 2019, ACL.

[52]  Tyler J. VanderWeele,et al.  On the definition of a confounder , 2013, Annals of statistics.

[53]  A. Kelly The Curriculum: Theory and Practice , 1977 .

[54]  Iryna Gurevych,et al.  ArgumenText: Searching for Arguments in Heterogeneous Sources , 2018, NAACL.

[55]  Arnold W. M. Smeulders,et al.  Active learning using pre-clustering , 2004, ICML.

[56]  Ani Nenkova,et al.  Predicting Annotation Difficulty to Improve Task Routing and Model Performance for Biomedical Information Extraction , 2019, NAACL-HLT.

[57]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[58]  Min Wang,et al.  Active learning through density clustering , 2017, Expert Syst. Appl..

[59]  Iryna Gurevych,et al.  Investigating label suggestions for opinion mining in German Covid-19 social media , 2021, ACL/IJCNLP.

[60]  Jon Chamberlain,et al.  Aggregation Driven Progression System for GWAPs , 2020, GAMNLP.

[61]  Paula Buttery,et al.  Entropy as a Proxy for Gap Complexity in Open Cloze Tests , 2019, RANLP.

[62]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[63]  Rong Jin,et al.  Active Learning by Querying Informative and Representative Examples , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[65]  Benoît Sagot,et al.  Influence of Pre-Annotation on POS-Tagged Corpus Development , 2010, Linguistic Annotation Workshop.

[66]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[67]  Iryna Gurevych,et al.  Empowering Active Learning to Jointly Optimize System and User Demands , 2020, ACL.

[68]  Udo Hahn,et al.  Timed Annotations — Enhancing MUC7 Metadata by the Time It Takes to Annotate Named Entities , 2009, Linguistic Annotation Workshop.

[69]  Aggelos K. Katsaggelos,et al.  Teaching citizen scientists to categorize glitches using machine learning guided training , 2020, Comput. Hum. Behav..

[70]  F. Baker The basics of item response theory , 1985 .

[71]  John Langford,et al.  Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds , 2019, ICLR.

[72]  Iryna Gurevych,et al.  The INCEpTION Platform: Machine-Assisted and Knowledge-Oriented Interactive Annotation , 2018, COLING.

[73]  Stephen Krashen,et al.  Principles and Practice in Second Language Acquisition , 1982 .

[74]  David Cohn,et al.  Active Learning , 2010, Encyclopedia of Machine Learning.

[75]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[76]  Ted Briscoe,et al.  Text Readability Assessment for Second Language Learners , 2016, BEA@NAACL-HLT.

[77]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[78]  Udo Kruschwitz,et al.  Comparing Bayesian Models of Annotation , 2018, TACL.

[79]  Jack Mostow,et al.  Generating Diagnostic Multiple Choice Comprehension Cloze Questions , 2012, BEA@NAACL-HLT.

[80]  Tianshun Yao,et al.  Active Learning with Sampling by Uncertainty and Density for Word Sense Disambiguation and Text Classification , 2008, COLING.

[81]  Beata Beigman Klebanov,et al.  Difficult Cases: From Data to Learning, and Back , 2014, ACL.

[82]  Praveen K. Paritosh,et al.  “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI , 2021, CHI.

[83]  Anastassia Loukina,et al.  Textual complexity as a predictor of difficulty of listening items in language proficiency tests , 2016, COLING.

[84]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[85]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[86]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.