论文信息 - Garbage in, garbage out?: do machine learning application papers in social computing report where human-labeled training data comes from?

Garbage in, garbage out?: do machine learning application papers in social computing report where human-labeled training data comes from?

Many machine learning projects for new application areas involve teams of humans who label data for a particular purpose, from hiring crowdworkers to the paper's authors labeling the data themselves. Such a task is quite similar to (or a form of) structured content analysis, which is a longstanding methodology in the social sciences and humanities, with many established best practices. In this paper, we investigate to what extent a sample of machine learning application papers in social computing --- specifically papers from ArXiv and traditional publications performing an ML classification task on Twitter data --- give specific details about whether such best practices were followed. Our team conducted multiple rounds of structured content analysis of each paper, making determinations such as: Does the paper report who the labelers were, what their qualifications were, whether they independently labeled the same items, whether inter-rater reliability metrics were disclosed, what level of training and/or instructions were given to labelers, whether compensation for crowdworkers is disclosed, and if the training data is publicly available. We find a wide divergence in whether such practices were followed and documented. Much of machine learning research and education focuses on what is done once a "gold standard" of training data is available, but we discuss issues around the equally-important aspect of whether such data is reliable in the first place.

[1] Bill Tomlinson,et al. Responsible research with crowds , 2018, Commun. ACM.

[2] Kalina Bontcheva,et al. Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines , 2014, LREC.

[3] S. Shapin. Laboratory life. The social construction of scientific facts , 1981, Medical History.

[4] Guy Stuart,et al. Databases, Felons, and Voting: Bias and Partisanship of the Florida Felons List in the 2000 Elections , 2004 .

[5] David Stuart,et al. SPAM: A Shadow History of the Internet , 2014 .

[6] Yugyung Lee,et al. Automated Management of Deep Learning Experiments , 2019, DEEM@SIGMOD.

[7] Paul T. Groth,et al. Ten Simple Rules for the Care and Feeding of Scientific Data , 2014, PLoS Comput. Biol..

[8] Inioluwa Deborah Raji,et al. Model Cards for Model Reporting , 2018, FAT.

[9] Manuel Blum,et al. reCAPTCHA: Human-Based Character Recognition via Web Security Measures , 2008, Science.

[10] Tim Head,et al. Binder 2.0 - Reproducible, interactive, sharable environments for science at scale , 2018, SciPy.

[11] Brian E. Granger,et al. IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[12] Timnit Gebru,et al. Datasheets for datasets , 2018, Commun. ACM.

[13] Tatyana Shatalova,et al. On the choice of measures of reliability and validity in the content-analysis of texts , 2014 .

[14] Richard A. Levine,et al. How Robust Are Multirater Interrater Reliability Indices to Changes in Frequency Distribution? , 2016 .

[15] et al.,et al. Jupyter Notebooks - a publishing format for reproducible computational workflows , 2016, ELPUB.

[16] Robert Tibshirani,et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[17] Jenna Burrell,et al. How the machine ‘thinks’: Understanding opacity in machine learning algorithms , 2016 .

[18] Stephen Lacy,et al. Analyzing Media Messages , 2019 .

[19] Sanjay Krishnan,et al. ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning , 2016, SIGMOD Conference.

[20] Lora Aroyo,et al. Measuring crowd truth: disagreement metrics combined with worker behavior filters , 2013 .

[21] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[22] Igor Mozetic,et al. Multilingual Twitter Sentiment Classification: The Role of Human Annotators , 2016, PloS one.

[23] Alan A Schreier,et al. Academic research record-keeping: best practices for individuals, group leaders, and institutions. , 2006, Academic medicine : journal of the Association of American Medical Colleges.

[24] Shipeng Yu,et al. Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks , 2012, J. Mach. Learn. Res..

[25] Felix Bießmann,et al. Automating Large-Scale Data Quality Verification , 2018, Proc. VLDB Endow..

[26] Suzanne A. Pierce,et al. Toward the Geoscience Paper of the Future: Best practices for documenting and sharing research from data to software to provenance , 2016 .

[27] Laura K. Nelson,et al. Computational Grounded Theory: A Methodological Framework , 2020 .

[28] Ian Taylor,et al. Towards Traceability in Data Ecosystems using a Bill of Materials Model , 2019, ArXiv.

[29] Ece Kamar,et al. Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets , 2017, CHI.

[30] Frank A. Pasquale. The Black Box Society: The Secret Algorithms That Control Money and Information , 2015 .

[31] Daniel Riffe,et al. Analyzing media messages: Using quantitative content analysis in research, Third edition , 2014 .

[32] J. Lavid,et al. Towards a ‘Science’ of Corpus Annotation: A New Methodological Challenge for Corpus Linguistics , 2013 .

[33] James C. Scott,et al. Seeing Like a State: How Certain Schemes to Improve the Human Condition Have Failed , 1999 .

[34] A. Strauss,et al. The discovery of grounded theory: strategies for qualitative research aldine de gruyter , 1968 .

[35] Norm Medeiros,et al. Teaching Integrity in Empirical Economics: The Pedagogy of Reproducible Science in Undergraduate Education , 2017 .

[36] Kalina Bontcheva,et al. GATE Teamware: a web-based, collaborative text annotation framework , 2013, Lang. Resour. Evaluation.

[37] Inioluwa Deborah Raji,et al. ABOUT ML: Annotation and Benchmarking on Understanding and Transparency of Machine Learning Lifecycles , 2019, ArXiv.

[38] Christine L. Borgman,et al. The conundrum of sharing research data , 2012, J. Assoc. Inf. Sci. Technol..

[39] Martín Pérez-Pérez,et al. Marky: A tool supporting annotation consistency in multi-user and iterative document annotation projects , 2015, Comput. Methods Programs Biomed..

[40] Ahmed Hosny,et al. The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards , 2018, Data Protection and Privacy.

[41] P. Sainsbury,et al. Consolidated criteria for reporting qualitative research (COREQ): a 32-item checklist for interviews and focus groups. , 2007, International journal for quality in health care : journal of the International Society for Quality in Health Care.

[42] Sebastian Schelter,et al. Automatically Tracking Metadata and Provenance of Machine Learning Experiments , 2017 .

[43] Trevor Hastie,et al. An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[44] Abigail Z. Jacobs,et al. Measurement and Fairness , 2019, FAccT.

[45] William W. Cohen. Learning Rules that Classify E-Mail , 1996 .

[46] Andrew Sallans,et al. DMP Online and DMPTool: Different Strategies Towards a Shared Goal , 2012, Int. J. Digit. Curation.

[47] Juan Carlos De Martin,et al. Ethical and Socially-Aware Data Labels , 2018, SIMBig.

[48] Benedikt Fecher,et al. Open Science: One Term, Five Schools of Thought , 2013 .

[49] Kush R. Varshney,et al. Increasing Trust in AI Services through Supplier's Declarations of Conformity , 2018, IBM J. Res. Dev..

[50] Wes McKinney,et al. Data Structures for Statistical Computing in Python , 2010, SciPy.

[51] Jefferson Provost,et al. Na ive-Bayes vs. Rule-Learning in Classification of Email , 1999 .

[52] B. Asher. The Professional Vision , 1994 .

[53] Robert Parker,et al. Annotation Tool Development for Large-Scale Corpus Creation Projects at the Linguistic Data Consortium , 2008, LREC.

[54] Emily M. Bender,et al. Data Statements for NLP: Toward Mitigating System Bias and Enabling Better Science , 2018 .

[55] Gaël Varoquaux,et al. The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[56] C. Babbage. Passages from the Life of a Philosopher , 1968 .

[57] J. Overhage,et al. Sorting Things Out: Classification and Its Consequences , 2001, Annals of Internal Medicine.

[58] Aaron Halfaker,et al. ORES: Lowering Barriers with Participatory Machine Learning in Wikipedia , 2020, Proc. ACM Hum. Comput. Interact..

[59] Lex Nederbragt,et al. Good enough practices in scientific computing , 2016, PLoS Comput. Biol..

[60] D. Dennis,et al. Seeing Like a State: How Certain Schemes to Improve the Human Condition Have Failed , 1998 .

[61] David De Roure,et al. Zooniverse: observing the world's largest citizen science platform , 2014, WWW.

[62] John D. Hunter,et al. Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[63] Eric Jones,et al. SciPy: Open Source Scientific Tools for Python , 2001 .

[64] Ariel Rokem,et al. Assessing Reproducibility (In The Practice of Reproducible Research Case Studies and Lessons from the Data-Intensive Sciences Justin Kitzes, Daniel Turek, Fatma Deniz (Eds.)) , 2017 .

[65] Andrea Forte,et al. Reliability and Inter-rater Reliability in Qualitative Research , 2019, Proc. ACM Hum. Comput. Interact..

[66] Hannah Lebovits. Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor , 2018, Public Integrity.