Ground-Truth, Whose Truth? - Examining the Challenges with Annotating Toxic Text Datasets

The use of machine learning (ML)-based language models (LMs) to monitor content online is on the rise. For toxic text identification, task-specific fine-tuning of these models are performed using datasets labeled by annotators who provide ground-truth labels in an effort to distinguish between offensive and normal content. These projects have led to the development, improvement, and expansion of large datasets over time, and have contributed immensely to research on natural language. Despite the achievements, existing evidence suggests that ML models built on these datasets do not always result in desirable outcomes. Therefore, using a design science research (DSR) approach, this study examines selected toxic text datasets with the goal of shedding light on some of the inherent issues and contributing to discussions on navigating these challenges for existing and future projects. To achieve the goal of the study, we re-annotate samples from three toxic text datasets and find that a multi-label approach to annotating toxic text samples can help to improve dataset quality. While this approach may not improve the traditional metric of inter-annotator agreement, it may better capture dependence on context and diversity in annotators. We discuss the implications of these results for both theory and practice.

[1]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[2]  Annamaria Mesaros,et al.  What is the ground truth? Reliability of multi-annotator data for audio tagging , 2021, 2021 29th European Signal Processing Conference (EUSIPCO).

[3]  Vinodkumar Prabhakaran,et al.  Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations , 2021, TACL.

[4]  Yejin Choi,et al.  Social Bias Frames: Reasoning about Social and Power Implications of Language , 2020, ACL.

[5]  Ron Artstein,et al.  Inter-annotator Agreement , 2017 .

[6]  Tuure Tuunanen,et al.  Design Science Research Evaluation , 2012, DESRIST.

[7]  J. Aken Management Research Based on the Paradigm of the Design Sciences: The Quest for Field-Tested and Grounded Technological Rules , 2004 .

[8]  Per Runeson,et al.  Using a Visual Abstract as a Lens for Communicating and Promoting Design Science Research in Software Engineering , 2017, 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[9]  Jonas Mueller,et al.  Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks , 2021, NeurIPS Datasets and Benchmarks.

[10]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[11]  Douglas W. Oard,et al.  Using text classification to improve annotation quality by improving annotator consistency , 2020 .

[12]  Michael S. Bernstein,et al.  The Disagreement Deconvolution: Bringing Machine Learning Performance Metrics In Line With Reality , 2021, CHI.

[13]  Nishanth R. Sastry,et al.  An Expert Annotated Dataset for the Detection of Online Misogyny , 2021, EACL.

[14]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[15]  Yoav Goldberg,et al.  Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets , 2019, EMNLP.

[16]  Alan R. Hevner,et al.  POSITIONING AND PRESENTING DESIGN SCIENCE RESEARCH FOR MAXIMUM IMPACT 1 , 2013 .

[17]  Jatinder Singh,et al.  Differential Tweetment: Mitigating Racial Dialect Bias in Harmful Tweet Detection , 2021, FAccT.

[18]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[19]  Klaus Krippendorff,et al.  Answering the Call for a Standard Reliability Measure for Coding Data , 2007 .

[20]  R. Geiger,et al.  “Garbage in, garbage out” revisited: What do machine learning application papers report about human-labeled training data? , 2021, Quantitative Science Studies.

[21]  Alan R. Hevner,et al.  Design Science in Information Systems Research , 2004, MIS Q..

[22]  Emily M. Bender,et al.  Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science , 2018, TACL.

[23]  Yejin Choi,et al.  The Risk of Racial Bias in Hate Speech Detection , 2019, ACL.

[24]  Animesh Mukherjee,et al.  HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection , 2020, AAAI.