Human Computation Must Be Reproducible

Human computation is the technique of performing a computational process by outsourcing some of the difficult-toautomate steps to humans. In the social and behavioral sciences, when using humans as measuring instruments, reproducibility guides the design and evaluation of experiments. We argue that human computation has similar properties, and that the results of human computation must be reproducible, in the least, in order to be informative. We might additionally require the results of human computation to have high validity or high utility, but the results must be reproducible in order to measure the validity or utility to a degree better than chance. Additionally, a focus on reproducibility has implications for design of task and instructions, as well as for the communication of the results. It is humbling how often the initial understanding of the task and guidelines turns out to lack reproducibility. We suggest ensuring, measuring and communicating reproducibility of human computation tasks.

[1]  L. Cronbach Coefficient alpha and the internal structure of tests , 1951 .

[2]  Manuel Blum,et al.  reCAPTCHA: Human-Based Character Recognition via Web Security Measures , 2008, Science.

[3]  R. Alpert,et al.  Communications Through Limited-Response Questioning , 1954 .

[4]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[5]  David A. Forsyth,et al.  Utility data annotation with Amazon Mechanical Turk , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[6]  Praveen Paritosh,et al.  The anatomy of a large-scale human computation engine , 2010, HCOMP '10.

[7]  R. Craggs,et al.  A two dimensional annotation scheme for emotion in dialogue , 2004 .

[8]  L. Koran,et al.  The reliability of clinical methods, data and judgments (first of two parts). , 1975, The New England journal of medicine.

[9]  W. A. Scott,et al.  Reliability of Content Analysis ; The Case of Nominal Scale Cording , 1955 .

[10]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[11]  C. Eccleston,et al.  Systematic review and meta-analysis of randomized controlled trials of cognitive behaviour therapy and behaviour therapy for chronic pain in adults, excluding headache , 1999, Pain.

[12]  Stefanie Nowak,et al.  How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation , 2010, MIR '10.

[13]  Adrien Treuille,et al.  Predicting protein structures with a multiplayer online game , 2010, Nature.

[14]  K. Krippendorff Bivariate Agreement Coefficients for Reliability of Data , 1970 .

[15]  C. P. Hughes,et al.  A New Clinical Scale for the Staging of Dementia , 1982, British Journal of Psychiatry.

[16]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[17]  Siddharth Suri,et al.  Conducting behavioral research on Amazon’s Mechanical Turk , 2010, Behavior research methods.

[18]  Omar Alonso,et al.  Crowdsourcing for relevance evaluation , 2008, SIGF.

[19]  Klaus Krippendorff,et al.  Answering the Call for a Standard Reliability Measure for Coding Data , 2007 .

[20]  A. Goodman,et al.  The reliability of psychiatric diagnosis in Israel's Psychiatric Case Register , 1984, Acta psychiatrica Scandinavica.

[21]  Luis von Ahn Games with a Purpose , 2006, Computer.

[22]  Peng Dai,et al.  Decision-Theoretic Control of Crowd-Sourced Workflows , 2010, AAAI.

[23]  Lukas Biewald,et al.  Programmatic Gold: Targeted and Scalable Quality Assurance in Crowdsourcing , 2011, Human Computation.

[24]  Roel Popping,et al.  On Agreement Indices for Nominal Data , 1988 .

[25]  H. Kraemer,et al.  2 x 2 kappa coefficients: measures of agreement or association. , 1989, Biometrics.

[26]  Colin Seymour-Ure,et al.  Content Analysis in Communication Research. , 1972 .

[27]  W. Nelson Statistical Methods for Reliability Data , 1998 .

[28]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[29]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[30]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[31]  Aniket Kittur,et al.  Crowdsourcing user studies with Mechanical Turk , 2008, CHI.

[32]  L. Koran,et al.  The reliability of clinical methods, data and judgments (second of two parts). , 1975, The New England journal of medicine.

[33]  T. Marteau,et al.  The Place of Inter-Rater Reliability in Qualitative Research: An Empirical Study , 1997 .

[34]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[35]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[36]  Page Keeley With a Purpose , 2011 .

[37]  Barbara Di Eugenio,et al.  Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[38]  J. Bert Keats,et al.  Statistical Methods for Reliability Data , 1999 .

[39]  Esser,et al.  Alive and Well after 25 Years: A Review of Groupthink Research. , 1998, Organizational behavior and human decision processes.

[40]  A. Aboraya,et al.  The Reliability of Psychiatric Diagnosis Revisited: The Clinician's Guide to Improve the Reliability of Psychiatric Diagnosis. , 2006, Psychiatry (Edgmont (Pa. : Township)).

[41]  A. Feinstein,et al.  High agreement but low kappa: I. The problems of two paradoxes. , 1990, Journal of clinical epidemiology.

[42]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .