论文信息 - The Disagreement Deconvolution: Bringing Machine Learning Performance Metrics In Line With Reality

The Disagreement Deconvolution: Bringing Machine Learning Performance Metrics In Line With Reality

Machine learning classifiers for human-facing tasks such as comment toxicity and misinformation often score highly on metrics such as ROC AUC but are received poorly in practice. Why this gap? Today, metrics such as ROC AUC, precision, and recall are used to measure technical performance; however, human-computer interaction observes that evaluation of human-facing systems should account for people’s reactions to the system. In this paper, we introduce a transformation that more closely aligns machine learning classification metrics with the values and methods of user-facing performance measures. The disagreement deconvolution takes in any multi-annotator (e.g., crowdsourced) dataset, disentangles stable opinions from noise by estimating intra-annotator consistency, and compares each test set prediction to the individual stable opinions from each annotator. Applying the disagreement deconvolution to existing social computing datasets, we find that current metrics dramatically overstate the performance of many human-facing machine learning tasks: for example, performance on a comment toxicity task is corrected from .95 to .73 ROC AUC.

[1] Barry Bayus,et al. Crowdsourcing in medical research: concepts and applications , 2019, PeerJ.

[2] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3] Jeremy P. Birnholtz,et al. How People Form Folk Theories of Social Media Feeds and What it Means for How We Study Self-Presentation , 2018, CHI.

[4] Ellie Pavlick,et al. Inherent Disagreements in Human Textual Inferences , 2019, Transactions of the Association for Computational Linguistics.

[5] Juho Kim,et al. Efficient Elicitation Approaches to Estimate Collective Crowd Answers , 2019, Proc. ACM Hum. Comput. Interact..

[6] Chris Callison-Burch,et al. Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk , 2009, EMNLP.

[7] Lora Aroyo,et al. Crowdsourcing Ground Truth for Medical Relation Extraction , 2017, ACM Trans. Interact. Intell. Syst..

[8] Jerry Alan Fails,et al. Interactive machine learning , 2003, IUI '03.

[9] Mausam,et al. Sprout: Crowd-Powered Task Design for Crowdsourcing , 2018, UIST.

[10] V. K. Chaithanya Manam,et al. WingIt: Efficient Refinement of Unclear Task Instructions , 2018, HCOMP.

[11] R. Caplan,et al. Tiered Governance and Demonetization: The Shifting Terms of Labor and Compensation in the Platform Economy , 2020, Social Media + Society.

[12] Lora Aroyo,et al. Capturing Ambiguity in Crowdsourcing Frame Disambiguation , 2018, HCOMP.

[13] Maya Cakmak,et al. Power to the People: The Role of Humans in Interactive Machine Learning , 2014, AI Mag..

[14] Emily Mower Provost,et al. Predicting the distribution of emotion perception: capturing inter-rater variability , 2017, ICMI.

[15] Michael S. Bernstein,et al. Street-Level Algorithms: A Theory at the Gaps Between Policy and Decisions , 2019, CHI.

[16] Ralf Krestel,et al. Challenges for Toxic Comment Classification: An In-Depth Error Analysis , 2018, ALW.

[17] Jing Ma,et al. Learning from Crowds by Modeling Common Confusions , 2021, AAAI.

[18] Derek Ruths,et al. Sentiment Analysis: It’s Complicated! , 2018, NAACL.

[19] Anca Dumitrache. Crowdsourcing Disagreement for Collecting Semantic Annotation , 2015, ESWC.

[20] Lucy Vasserman,et al. Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification , 2019, WWW.

[21] B. Bailar. Recent Research in Reinterview Procedures , 1968 .

[22] The Role of Source and Expressive Responding in Political News Evaluation , 2018 .

[23] Amy X. Zhang,et al. Investigating Differences in Crowdsourced News Credibility Assessment , 2020, Proc. ACM Hum. Comput. Interact..

[24] Michael Veale,et al. Like Trainer, Like Bot? Inheritance of Bias in Algorithmic Content Moderation , 2017, SocInfo.

[25] Louis Guttman,et al. The test-retest reliability of qualitative data , 1946, Psychometrika.

[26] K. Karahalios,et al. "I always assumed that I wasn't really that close to [her]": Reasoning about Invisible Algorithms in News Feeds , 2015, CHI.

[27] Yang Li,et al. Gestures without libraries, toolkits or training: a $1 recognizer for user interface prototypes , 2007, UIST.

[28] James A. Landay,et al. Examining Difficulties Software Developers Encounter in the Adoption of Statistical Machine Learning , 2008, AAAI.

[29] Björn Ross,et al. Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis , 2016, ArXiv.

[30] Eric Gilbert,et al. CREDBANK: A Large-Scale Social Media Corpus With Associated Credibility Annotations , 2015, ICWSM.

[31] Harald C. Gall,et al. Software Engineering for Machine Learning: A Case Study , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[32] Tong Liu,et al. Learning to Predict Population-Level Label Distributions , 2019, WWW.

[33] Mehmet Fatih Çömlekçi. Custodians of the Internet: Platforms, Content Moderation, and the Hidden Decisions that Shape Social Media , 2019 .

[34] Kathleen F. McCoy,et al. User Interaction with Word Prediction: The Effects of Prediction Quality , 2009, TACC.

[36] Eyke Hüllermeier,et al. Bayes Optimal Multilabel Classification via Probabilistic Classifier Chains , 2010, ICML.

[37] Bernard J. Jansen,et al. Online Hate Interpretation Varies by Country, But More by Individual: A Statistical Analysis Using Crowdsourced Ratings , 2018, 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS).

[38] Paul N. Bennett,et al. Guidelines for Human-AI Interaction , 2019, CHI.

[39] Ece Kamar,et al. Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets , 2017, CHI.

[40] Reza Zafarani,et al. Fake News: A Survey of Research, Detection Methods, and Opportunities , 2018, ArXiv.

[41] Reut Tsarfaty,et al. Evaluating NLP Models via Contrast Sets , 2020, ArXiv.

[42] Eric Horvitz,et al. Principles of mixed-initiative user interfaces , 1999, CHI '99.

[43] Christopher Ré,et al. Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..

[44] Alex Hai Wang,et al. Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach , 2010, DBSec.