The Disagreement Deconvolution: Bringing Machine Learning Performance Metrics In Line With Reality

Machine learning classifiers for human-facing tasks such as comment toxicity and misinformation often score highly on metrics such as ROC AUC but are received poorly in practice. Why this gap? Today, metrics such as ROC AUC, precision, and recall are used to measure technical performance; however, human-computer interaction observes that evaluation of human-facing systems should account for people’s reactions to the system. In this paper, we introduce a transformation that more closely aligns machine learning classification metrics with the values and methods of user-facing performance measures. The disagreement deconvolution takes in any multi-annotator (e.g., crowdsourced) dataset, disentangles stable opinions from noise by estimating intra-annotator consistency, and compares each test set prediction to the individual stable opinions from each annotator. Applying the disagreement deconvolution to existing social computing datasets, we find that current metrics dramatically overstate the performance of many human-facing machine learning tasks: for example, performance on a comment toxicity task is corrected from .95 to .73 ROC AUC.

[1]  Barry Bayus,et al.  Crowdsourcing in medical research: concepts and applications , 2019, PeerJ.

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Jeremy P. Birnholtz,et al.  How People Form Folk Theories of Social Media Feeds and What it Means for How We Study Self-Presentation , 2018, CHI.

[4]  Ellie Pavlick,et al.  Inherent Disagreements in Human Textual Inferences , 2019, Transactions of the Association for Computational Linguistics.

[5]  Juho Kim,et al.  Efficient Elicitation Approaches to Estimate Collective Crowd Answers , 2019, Proc. ACM Hum. Comput. Interact..

[6]  Chris Callison-Burch,et al.  Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk , 2009, EMNLP.

[7]  Lora Aroyo,et al.  Crowdsourcing Ground Truth for Medical Relation Extraction , 2017, ACM Trans. Interact. Intell. Syst..

[8]  Jerry Alan Fails,et al.  Interactive machine learning , 2003, IUI '03.

[9]  Mausam,et al.  Sprout: Crowd-Powered Task Design for Crowdsourcing , 2018, UIST.

[10]  V. K. Chaithanya Manam,et al.  WingIt: Efficient Refinement of Unclear Task Instructions , 2018, HCOMP.

[11]  R. Caplan,et al.  Tiered Governance and Demonetization: The Shifting Terms of Labor and Compensation in the Platform Economy , 2020, Social Media + Society.

[12]  Lora Aroyo,et al.  Capturing Ambiguity in Crowdsourcing Frame Disambiguation , 2018, HCOMP.

[13]  Maya Cakmak,et al.  Power to the People: The Role of Humans in Interactive Machine Learning , 2014, AI Mag..

[14]  Emily Mower Provost,et al.  Predicting the distribution of emotion perception: capturing inter-rater variability , 2017, ICMI.

[15]  Michael S. Bernstein,et al.  Street-Level Algorithms: A Theory at the Gaps Between Policy and Decisions , 2019, CHI.

[16]  Ralf Krestel,et al.  Challenges for Toxic Comment Classification: An In-Depth Error Analysis , 2018, ALW.

[17]  Jing Ma,et al.  Learning from Crowds by Modeling Common Confusions , 2021, AAAI.

[18]  Derek Ruths,et al.  Sentiment Analysis: It’s Complicated! , 2018, NAACL.

[19]  Anca Dumitrache Crowdsourcing Disagreement for Collecting Semantic Annotation , 2015, ESWC.

[20]  Lucy Vasserman,et al.  Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification , 2019, WWW.

[21]  B. Bailar Recent Research in Reinterview Procedures , 1968 .

[22]  The Role of Source and Expressive Responding in Political News Evaluation , 2018 .

[23]  Amy X. Zhang,et al.  Investigating Differences in Crowdsourced News Credibility Assessment , 2020, Proc. ACM Hum. Comput. Interact..

[24]  Michael Veale,et al.  Like Trainer, Like Bot? Inheritance of Bias in Algorithmic Content Moderation , 2017, SocInfo.

[25]  Louis Guttman,et al.  The test-retest reliability of qualitative data , 1946, Psychometrika.

[26]  K. Karahalios,et al.  "I always assumed that I wasn't really that close to [her]": Reasoning about Invisible Algorithms in News Feeds , 2015, CHI.

[27]  Yang Li,et al.  Gestures without libraries, toolkits or training: a $1 recognizer for user interface prototypes , 2007, UIST.

[28]  James A. Landay,et al.  Examining Difficulties Software Developers Encounter in the Adoption of Statistical Machine Learning , 2008, AAAI.

[29]  Björn Ross,et al.  Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis , 2016, ArXiv.

[30]  Eric Gilbert,et al.  CREDBANK: A Large-Scale Social Media Corpus With Associated Credibility Annotations , 2015, ICWSM.

[31]  Harald C. Gall,et al.  Software Engineering for Machine Learning: A Case Study , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[32]  Tong Liu,et al.  Learning to Predict Population-Level Label Distributions , 2019, WWW.

[33]  Mehmet Fatih Çömlekçi Custodians of the Internet: Platforms, Content Moderation, and the Hidden Decisions that Shape Social Media , 2019 .

[34]  Kathleen F. McCoy,et al.  User Interaction with Word Prediction: The Effects of Prediction Quality , 2009, TACC.

[36]  Eyke Hüllermeier,et al.  Bayes Optimal Multilabel Classification via Probabilistic Classifier Chains , 2010, ICML.

[37]  Bernard J. Jansen,et al.  Online Hate Interpretation Varies by Country, But More by Individual: A Statistical Analysis Using Crowdsourced Ratings , 2018, 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS).

[38]  Paul N. Bennett,et al.  Guidelines for Human-AI Interaction , 2019, CHI.

[39]  Ece Kamar,et al.  Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets , 2017, CHI.

[40]  Reza Zafarani,et al.  Fake News: A Survey of Research, Detection Methods, and Opportunities , 2018, ArXiv.

[41]  Reut Tsarfaty,et al.  Evaluating NLP Models via Contrast Sets , 2020, ArXiv.

[42]  Eric Horvitz,et al.  Principles of mixed-initiative user interfaces , 1999, CHI '99.

[43]  Christopher Ré,et al.  Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..

[44]  Alex Hai Wang,et al.  Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach , 2010, DBSec.

[45]  Karrie Karahalios,et al.  Auditing Algorithms : Research Methods for Detecting Discrimination on Internet Platforms , 2014 .

[46]  Munmun De Choudhury,et al.  #thyghgapp: Instagram Content Moderation and Lexical Variation in Pro-Eating Disorder Communities , 2016, CSCW.

[47]  Maiko Spiess Sorting Things Out - Classification and Its Consequences , 2010 .

[48]  Michael Desmond,et al.  Designing Ground Truth and the Social Life of Labels , 2021, CHI.

[49]  K. Junghanns,et al.  Test-retest reliability and validity of the Pittsburgh Sleep Quality Index in primary insomnia. , 2002, Journal of psychosomatic research.

[50]  John Le,et al.  Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution , 2010 .

[51]  Vassilis P. Plagianakos,et al.  Convolutional Neural Networks for Toxic Comment Classification , 2018, SETN.

[52]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[53]  Suhang Wang,et al.  Fake News Detection on Social Media: A Data Mining Perspective , 2017, SKDD.

[54]  Quan Do,et al.  Jigsaw Unintended Bias in Toxicity Classification , 2019 .

[55]  Nicholas R. Jennings,et al.  Bayesian Aggregation of Categorical Distributions with Applications in Crowdsourcing , 2017, IJCAI.

[56]  Jeffrey Heer,et al.  Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks , 2016, CSCW.

[57]  Thomas L. Griffiths,et al.  Human Uncertainty Makes Classification More Robust , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[58]  Kristina Lerman,et al.  A Survey on Bias and Fairness in Machine Learning , 2019, ACM Comput. Surv..

[59]  David R. Karger,et al.  Squadbox: A Tool to Combat Email Harassment Using Friendsourced Moderation , 2018, CHI.

[60]  Harini Suresh,et al.  A Framework for Understanding Unintended Consequences of Machine Learning , 2019, ArXiv.

[61]  Vikas Sindhwani,et al.  Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria , 2009, HLT-NAACL 2009.

[62]  J. Overhage,et al.  Sorting Things Out: Classification and Its Consequences , 2001, Annals of Internal Medicine.

[63]  Timnit Gebru,et al.  Lessons from archives: strategies for collecting sociocultural data in machine learning , 2019, FAT*.

[64]  Louis Guttman,et al.  A basis for analyzing test-retest reliability , 1945, Psychometrika.

[65]  Zeerak Waseem,et al.  Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter , 2016, NLP+CSS@EMNLP.