The Disagreement Deconvolution: Bringing Machine Learning Performance Metrics In Line With Reality
暂无分享,去创建一个
Michael S. Bernstein | Kayur Patel | Tatsunori B. Hashimoto | Kaitlyn Zhou | Mitchell L. Gordon | Tatsunori Hashimoto | Kayur Patel | Kaitlyn Zhou
[1] Barry Bayus,et al. Crowdsourcing in medical research: concepts and applications , 2019, PeerJ.
[2] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[3] Jeremy P. Birnholtz,et al. How People Form Folk Theories of Social Media Feeds and What it Means for How We Study Self-Presentation , 2018, CHI.
[4] Ellie Pavlick,et al. Inherent Disagreements in Human Textual Inferences , 2019, Transactions of the Association for Computational Linguistics.
[5] Juho Kim,et al. Efficient Elicitation Approaches to Estimate Collective Crowd Answers , 2019, Proc. ACM Hum. Comput. Interact..
[6] Chris Callison-Burch,et al. Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk , 2009, EMNLP.
[7] Lora Aroyo,et al. Crowdsourcing Ground Truth for Medical Relation Extraction , 2017, ACM Trans. Interact. Intell. Syst..
[8] Jerry Alan Fails,et al. Interactive machine learning , 2003, IUI '03.
[9] Mausam,et al. Sprout: Crowd-Powered Task Design for Crowdsourcing , 2018, UIST.
[10] V. K. Chaithanya Manam,et al. WingIt: Efficient Refinement of Unclear Task Instructions , 2018, HCOMP.
[11] R. Caplan,et al. Tiered Governance and Demonetization: The Shifting Terms of Labor and Compensation in the Platform Economy , 2020, Social Media + Society.
[12] Lora Aroyo,et al. Capturing Ambiguity in Crowdsourcing Frame Disambiguation , 2018, HCOMP.
[13] Maya Cakmak,et al. Power to the People: The Role of Humans in Interactive Machine Learning , 2014, AI Mag..
[14] Emily Mower Provost,et al. Predicting the distribution of emotion perception: capturing inter-rater variability , 2017, ICMI.
[15] Michael S. Bernstein,et al. Street-Level Algorithms: A Theory at the Gaps Between Policy and Decisions , 2019, CHI.
[16] Ralf Krestel,et al. Challenges for Toxic Comment Classification: An In-Depth Error Analysis , 2018, ALW.
[17] Jing Ma,et al. Learning from Crowds by Modeling Common Confusions , 2021, AAAI.
[18] Derek Ruths,et al. Sentiment Analysis: It’s Complicated! , 2018, NAACL.
[19] Anca Dumitrache. Crowdsourcing Disagreement for Collecting Semantic Annotation , 2015, ESWC.
[20] Lucy Vasserman,et al. Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification , 2019, WWW.
[21] B. Bailar. Recent Research in Reinterview Procedures , 1968 .
[22] The Role of Source and Expressive Responding in Political News Evaluation , 2018 .
[23] Amy X. Zhang,et al. Investigating Differences in Crowdsourced News Credibility Assessment , 2020, Proc. ACM Hum. Comput. Interact..
[24] Michael Veale,et al. Like Trainer, Like Bot? Inheritance of Bias in Algorithmic Content Moderation , 2017, SocInfo.
[25] Louis Guttman,et al. The test-retest reliability of qualitative data , 1946, Psychometrika.
[26] K. Karahalios,et al. "I always assumed that I wasn't really that close to [her]": Reasoning about Invisible Algorithms in News Feeds , 2015, CHI.
[27] Yang Li,et al. Gestures without libraries, toolkits or training: a $1 recognizer for user interface prototypes , 2007, UIST.
[28] James A. Landay,et al. Examining Difficulties Software Developers Encounter in the Adoption of Statistical Machine Learning , 2008, AAAI.
[29] Björn Ross,et al. Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis , 2016, ArXiv.
[30] Eric Gilbert,et al. CREDBANK: A Large-Scale Social Media Corpus With Associated Credibility Annotations , 2015, ICWSM.
[31] Harald C. Gall,et al. Software Engineering for Machine Learning: A Case Study , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).
[32] Tong Liu,et al. Learning to Predict Population-Level Label Distributions , 2019, WWW.
[33] Mehmet Fatih Çömlekçi. Custodians of the Internet: Platforms, Content Moderation, and the Hidden Decisions that Shape Social Media , 2019 .
[34] Kathleen F. McCoy,et al. User Interaction with Word Prediction: The Effects of Prediction Quality , 2009, TACC.
[36] Eyke Hüllermeier,et al. Bayes Optimal Multilabel Classification via Probabilistic Classifier Chains , 2010, ICML.
[37] Bernard J. Jansen,et al. Online Hate Interpretation Varies by Country, But More by Individual: A Statistical Analysis Using Crowdsourced Ratings , 2018, 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS).
[38] Paul N. Bennett,et al. Guidelines for Human-AI Interaction , 2019, CHI.
[39] Ece Kamar,et al. Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets , 2017, CHI.
[40] Reza Zafarani,et al. Fake News: A Survey of Research, Detection Methods, and Opportunities , 2018, ArXiv.
[41] Reut Tsarfaty,et al. Evaluating NLP Models via Contrast Sets , 2020, ArXiv.
[42] Eric Horvitz,et al. Principles of mixed-initiative user interfaces , 1999, CHI '99.
[43] Christopher Ré,et al. Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..
[44] Alex Hai Wang,et al. Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach , 2010, DBSec.
[45] Karrie Karahalios,et al. Auditing Algorithms : Research Methods for Detecting Discrimination on Internet Platforms , 2014 .
[46] Munmun De Choudhury,et al. #thyghgapp: Instagram Content Moderation and Lexical Variation in Pro-Eating Disorder Communities , 2016, CSCW.
[47] Maiko Spiess. Sorting Things Out - Classification and Its Consequences , 2010 .
[48] Michael Desmond,et al. Designing Ground Truth and the Social Life of Labels , 2021, CHI.
[49] K. Junghanns,et al. Test-retest reliability and validity of the Pittsburgh Sleep Quality Index in primary insomnia. , 2002, Journal of psychosomatic research.
[50] John Le,et al. Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution , 2010 .
[51] Vassilis P. Plagianakos,et al. Convolutional Neural Networks for Toxic Comment Classification , 2018, SETN.
[52] Panagiotis G. Ipeirotis,et al. Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.
[53] Suhang Wang,et al. Fake News Detection on Social Media: A Data Mining Perspective , 2017, SKDD.
[54] Quan Do,et al. Jigsaw Unintended Bias in Toxicity Classification , 2019 .
[55] Nicholas R. Jennings,et al. Bayesian Aggregation of Categorical Distributions with Applications in Crowdsourcing , 2017, IJCAI.
[56] Jeffrey Heer,et al. Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks , 2016, CSCW.
[57] Thomas L. Griffiths,et al. Human Uncertainty Makes Classification More Robust , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[58] Kristina Lerman,et al. A Survey on Bias and Fairness in Machine Learning , 2019, ACM Comput. Surv..
[59] David R. Karger,et al. Squadbox: A Tool to Combat Email Harassment Using Friendsourced Moderation , 2018, CHI.
[60] Harini Suresh,et al. A Framework for Understanding Unintended Consequences of Machine Learning , 2019, ArXiv.
[61] Vikas Sindhwani,et al. Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria , 2009, HLT-NAACL 2009.
[62] J. Overhage,et al. Sorting Things Out: Classification and Its Consequences , 2001, Annals of Internal Medicine.
[63] Timnit Gebru,et al. Lessons from archives: strategies for collecting sociocultural data in machine learning , 2019, FAT*.
[64] Louis Guttman,et al. A basis for analyzing test-retest reliability , 1945, Psychometrika.
[65] Zeerak Waseem,et al. Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter , 2016, NLP+CSS@EMNLP.