Garbage in, garbage out?: do machine learning application papers in social computing report where human-labeled training data comes from?

Many machine learning projects for new application areas involve teams of humans who label data for a particular purpose, from hiring crowdworkers to the paper's authors labeling the data themselves. Such a task is quite similar to (or a form of) structured content analysis, which is a longstanding methodology in the social sciences and humanities, with many established best practices. In this paper, we investigate to what extent a sample of machine learning application papers in social computing --- specifically papers from ArXiv and traditional publications performing an ML classification task on Twitter data --- give specific details about whether such best practices were followed. Our team conducted multiple rounds of structured content analysis of each paper, making determinations such as: Does the paper report who the labelers were, what their qualifications were, whether they independently labeled the same items, whether inter-rater reliability metrics were disclosed, what level of training and/or instructions were given to labelers, whether compensation for crowdworkers is disclosed, and if the training data is publicly available. We find a wide divergence in whether such practices were followed and documented. Much of machine learning research and education focuses on what is done once a "gold standard" of training data is available, but we discuss issues around the equally-important aspect of whether such data is reliable in the first place.

[1]  Bill Tomlinson,et al.  Responsible research with crowds , 2018, Commun. ACM.

[2]  Kalina Bontcheva,et al.  Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines , 2014, LREC.

[3]  S. Shapin Laboratory life. The social construction of scientific facts , 1981, Medical History.

[4]  Guy Stuart,et al.  Databases, Felons, and Voting: Bias and Partisanship of the Florida Felons List in the 2000 Elections , 2004 .

[5]  David Stuart,et al.  SPAM: A Shadow History of the Internet , 2014 .

[6]  Yugyung Lee,et al.  Automated Management of Deep Learning Experiments , 2019, DEEM@SIGMOD.

[7]  Paul T. Groth,et al.  Ten Simple Rules for the Care and Feeding of Scientific Data , 2014, PLoS Comput. Biol..

[8]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[9]  Manuel Blum,et al.  reCAPTCHA: Human-Based Character Recognition via Web Security Measures , 2008, Science.

[10]  Tim Head,et al.  Binder 2.0 - Reproducible, interactive, sharable environments for science at scale , 2018, SciPy.

[11]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[12]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[13]  Tatyana Shatalova,et al.  On the choice of measures of reliability and validity in the content-analysis of texts , 2014 .

[14]  Richard A. Levine,et al.  How Robust Are Multirater Interrater Reliability Indices to Changes in Frequency Distribution? , 2016 .

[15]  et al.,et al.  Jupyter Notebooks - a publishing format for reproducible computational workflows , 2016, ELPUB.

[16]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[17]  Jenna Burrell,et al.  How the machine ‘thinks’: Understanding opacity in machine learning algorithms , 2016 .

[18]  Stephen Lacy,et al.  Analyzing Media Messages , 2019 .

[19]  Sanjay Krishnan,et al.  ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning , 2016, SIGMOD Conference.

[20]  Lora Aroyo,et al.  Measuring crowd truth: disagreement metrics combined with worker behavior filters , 2013 .

[21]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[22]  Igor Mozetic,et al.  Multilingual Twitter Sentiment Classification: The Role of Human Annotators , 2016, PloS one.

[23]  Alan A Schreier,et al.  Academic research record-keeping: best practices for individuals, group leaders, and institutions. , 2006, Academic medicine : journal of the Association of American Medical Colleges.

[24]  Shipeng Yu,et al.  Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks , 2012, J. Mach. Learn. Res..

[25]  Felix Bießmann,et al.  Automating Large-Scale Data Quality Verification , 2018, Proc. VLDB Endow..

[26]  Suzanne A. Pierce,et al.  Toward the Geoscience Paper of the Future: Best practices for documenting and sharing research from data to software to provenance , 2016 .

[27]  Laura K. Nelson,et al.  Computational Grounded Theory: A Methodological Framework , 2020 .

[28]  Ian Taylor,et al.  Towards Traceability in Data Ecosystems using a Bill of Materials Model , 2019, ArXiv.

[29]  Ece Kamar,et al.  Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets , 2017, CHI.

[30]  Frank A. Pasquale The Black Box Society: The Secret Algorithms That Control Money and Information , 2015 .

[31]  Daniel Riffe,et al.  Analyzing media messages: Using quantitative content analysis in research, Third edition , 2014 .

[32]  J. Lavid,et al.  Towards a ‘Science’ of Corpus Annotation: A New Methodological Challenge for Corpus Linguistics , 2013 .

[33]  James C. Scott,et al.  Seeing Like a State: How Certain Schemes to Improve the Human Condition Have Failed , 1999 .

[34]  A. Strauss,et al.  The discovery of grounded theory: strategies for qualitative research aldine de gruyter , 1968 .

[35]  Norm Medeiros,et al.  Teaching Integrity in Empirical Economics: The Pedagogy of Reproducible Science in Undergraduate Education , 2017 .

[36]  Kalina Bontcheva,et al.  GATE Teamware: a web-based, collaborative text annotation framework , 2013, Lang. Resour. Evaluation.

[37]  Inioluwa Deborah Raji,et al.  ABOUT ML: Annotation and Benchmarking on Understanding and Transparency of Machine Learning Lifecycles , 2019, ArXiv.

[38]  Christine L. Borgman,et al.  The conundrum of sharing research data , 2012, J. Assoc. Inf. Sci. Technol..

[39]  Martín Pérez-Pérez,et al.  Marky: A tool supporting annotation consistency in multi-user and iterative document annotation projects , 2015, Comput. Methods Programs Biomed..

[40]  Ahmed Hosny,et al.  The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards , 2018, Data Protection and Privacy.

[41]  P. Sainsbury,et al.  Consolidated criteria for reporting qualitative research (COREQ): a 32-item checklist for interviews and focus groups. , 2007, International journal for quality in health care : journal of the International Society for Quality in Health Care.

[42]  Sebastian Schelter,et al.  Automatically Tracking Metadata and Provenance of Machine Learning Experiments , 2017 .

[43]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[44]  Abigail Z. Jacobs,et al.  Measurement and Fairness , 2019, FAccT.

[45]  William W. Cohen Learning Rules that Classify E-Mail , 1996 .

[46]  Andrew Sallans,et al.  DMP Online and DMPTool: Different Strategies Towards a Shared Goal , 2012, Int. J. Digit. Curation.

[47]  Juan Carlos De Martin,et al.  Ethical and Socially-Aware Data Labels , 2018, SIMBig.

[48]  Benedikt Fecher,et al.  Open Science: One Term, Five Schools of Thought , 2013 .

[49]  Kush R. Varshney,et al.  Increasing Trust in AI Services through Supplier's Declarations of Conformity , 2018, IBM J. Res. Dev..

[50]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[51]  Jefferson Provost,et al.  Na ive-Bayes vs. Rule-Learning in Classification of Email , 1999 .

[52]  B. Asher The Professional Vision , 1994 .

[53]  Robert Parker,et al.  Annotation Tool Development for Large-Scale Corpus Creation Projects at the Linguistic Data Consortium , 2008, LREC.

[54]  Emily M. Bender,et al.  Data Statements for NLP: Toward Mitigating System Bias and Enabling Better Science , 2018 .

[55]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[56]  C. Babbage Passages from the Life of a Philosopher , 1968 .

[57]  J. Overhage,et al.  Sorting Things Out: Classification and Its Consequences , 2001, Annals of Internal Medicine.

[58]  Aaron Halfaker,et al.  ORES: Lowering Barriers with Participatory Machine Learning in Wikipedia , 2020, Proc. ACM Hum. Comput. Interact..

[59]  Lex Nederbragt,et al.  Good enough practices in scientific computing , 2016, PLoS Comput. Biol..

[60]  D. Dennis,et al.  Seeing Like a State: How Certain Schemes to Improve the Human Condition Have Failed , 1998 .

[61]  David De Roure,et al.  Zooniverse: observing the world's largest citizen science platform , 2014, WWW.

[62]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[63]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[64]  Ariel Rokem,et al.  Assessing Reproducibility (In The Practice of Reproducible Research Case Studies and Lessons from the Data-Intensive Sciences Justin Kitzes, Daniel Turek, Fatma Deniz (Eds.)) , 2017 .

[65]  Andrea Forte,et al.  Reliability and Inter-rater Reliability in Qualitative Research , 2019, Proc. ACM Hum. Comput. Interact..

[66]  Hannah Lebovits Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor , 2018, Public Integrity.