Identifying Promising Items: The Use of Crowdsourcing in the Development of Assessment Instruments

The psychometrically sound development of assessment instruments requires pilot testing of candidate items as a first step in gauging their quality, typically a time-consuming and costly effort. Crowdsourcing offers the opportunity for gathering data much more quickly and inexpensively than from most targeted populations. In a simulation of a pilot testing protocol, item parameters for 110 life science questions are estimated from 4,043 crowdsourced adult subjects and then compared with those from 20,937 middle school science students. In terms of item discrimination classification (high vs. low), classical test theory yields an acceptable level of agreement (C-statistic = 0.755); item response theory produces excellent results (C-statistic = 0.848). Item response theory also identifies potential anchor items without including any false positives (items with low discrimination in the targeted population). We conclude that the use of crowdsourcing subjects is a reasonable, efficient method for the identification of high-quality items for field testing and for the selection of anchor items to be used for test equating.

[1]  Graham K. Rand,et al.  Quantitative Applications in the Social Sciences , 1983 .

[2]  Feral Ogan‐Bekiroglu,et al.  Assessing Assessment: Examination of pre‐service physics teachers' attitudes towards assessment and factors affecting their attitudes , 2009 .

[3]  R. Shavelson,et al.  On the evaluation of systemic science education reform: Searching for instructional sensitivity , 2002 .

[4]  M. Reckase Unifactor Latent Trait Models Applied to Multifactor Tests: Results and Implications , 1979 .

[5]  Li Cai,et al.  SEM of another flavour: two new applications of the supplemented EM algorithm. , 2008, The British journal of mathematical and statistical psychology.

[6]  Tracy M. Kantrowitz,et al.  General Mental Ability as a Source of Differential Functioning in Personality Scales , 2014 .

[7]  Martha L. Stocking,et al.  Developing a Common Metric in Item Response Theory , 1982 .

[8]  Brian L. Sullivan,et al.  eBird: Engaging Birders in Science and Conservation , 2011, PLoS biology.

[9]  Kalina Bontcheva,et al.  Crowdsourcing research opportunities: lessons from natural language processing , 2012, i-KNOW '12.

[10]  Michael Keating,et al.  Crowdsourcing: A Flexible Method for Innovation, Data Collection, and Analysis in Social Science Research , 2013 .

[11]  Panagiotis G. Ipeirotis Analyzing the Amazon Mechanical Turk marketplace , 2010, XRDS.

[12]  Tara S. Behrend,et al.  The viability of crowdsourcing for survey research , 2011, Behavior research methods.

[13]  Maria Orlando,et al.  Further Investigation of the Performance of S - X2: An Item Fit Index for Use With Dichotomous Item Response Theory Models , 2003 .

[14]  Philip M. Sadler,et al.  The Astronomy and Space Science Concept Inventory: Development and Validation of Assessment Instruments Aligned with the K-12 National Science Standards , 2009 .

[15]  Leigh Burstein,et al.  Instructionally Sensitive Psychometrics: Application of a New IRT‐Based Detection Technique to Mathematics Achievement Test Items , 1991 .

[16]  Kayla J. Heffernan,et al.  Crowdsourcing for clinical research: an evaluation of maturity , 2014 .

[17]  Ge Wang,et al.  Evaluating Crowdsourcing through Amazon Mechanical Turk as a Technique for Conducting Music Perception Experiments , 2012 .

[18]  Li Cai,et al.  The Langer-Improved Wald Test for DIF Testing With Multiple Groups , 2013 .

[19]  D. Tranqui,et al.  FOLDIT(LIGHT) – an interactive program for Macintosh computers to analyze and display Protein Data Bank coordinate files , 1994 .

[20]  Panagiotis G. Ipeirotis,et al.  Running Experiments on Amazon Mechanical Turk , 2010, Judgment and Decision Making.

[21]  N. Sanjay Rebello,et al.  The effect of distracters on student performance on the force concept inventory , 2004 .

[22]  Jesse Chandler,et al.  Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers , 2013, Behavior Research Methods.

[23]  Russell L Woods,et al.  Crowdsourcing a Normative Natural Language Dataset: A Comparison of Amazon Mechanical Turk and In-Lab Data Collection , 2013, Journal of medical Internet research.

[24]  T. Funkhouser,et al.  How well do line drawings depict shape? , 2009, SIGGRAPH '09.

[25]  Christopher S Lowry,et al.  CrowdHydrology: Crowdsourcing Hydrologic Data and Engaging Citizen Scientists , 2013, Ground water.

[26]  Harold P. Coyle,et al.  Assessing the Life Science Knowledge of Students and Teachers Represented by the K–8 National Science Standards , 2013, CBE life sciences education.

[27]  F. Baker The basics of item response theory , 1985 .

[28]  Ibrahim A. Halloun,et al.  The initial knowledge state of college physics students , 1985 .

[29]  W. W. Peterson,et al.  The theory of signal detectability , 1954, Trans. IRE Prof. Group Inf. Theory.

[30]  M. Brandt Sexism and Gender Inequality Across 57 Societies , 2011, Psychological science.

[31]  M. Hossain,et al.  Users' motivation to participate in online crowdsourcing platforms , 2012, 2012 International Conference on Innovation Management and Technology Research.

[32]  J. Shea National Science Education Standards , 1995 .

[33]  Bruno D. Zumbo,et al.  Three Generations of DIF Analyses: Considering Where It Has Been, Where It Is Now, and Where It Is Going , 2007 .

[34]  M. Banaji,et al.  Psychological. , 2015, The journals of gerontology. Series B, Psychological sciences and social sciences.

[35]  D. Hestenes,et al.  Force concept inventory , 1992 .

[36]  Gabriella Kazai,et al.  An analysis of human factors and label accuracy in crowdsourcing relevance judgments , 2013, Information Retrieval.

[37]  M. R. Novick,et al.  Statistical Theories of Mental Test Scores. , 1971 .

[38]  R. Hambleton,et al.  Fundamentals of Item Response Theory , 1991 .

[39]  Dorret I. Boomsma,et al.  Variance Decomposition Using an IRT Measurement Model , 2007, Behavior genetics.

[40]  Tarek Azzam,et al.  Finding a Comparison Group , 2013 .

[41]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[42]  Daniel Dench,et al.  Crowdsourcing data collection of the retail tobacco environment: case study comparing data from crowdsourced workers to trained data collectors , 2014, Tobacco Control.

[43]  D. Bamber The area above the ordinal dominance graph and the area below the receiver operating characteristic graph , 1975 .

[44]  David J. Weiss,et al.  The Person Response Curve: Fit of Individuals to Item Response Theory Models , 1983 .

[45]  Mark J. Gierl,et al.  Software Review : XCALIBRE™ Marginal Maximum-Likelihood Estimation Program, Windows™ Version 1. 10 , 1996 .

[46]  A. Armstrong,et al.  Power of crowdsourcing: novel methods of data collection in psoriasis and psoriatic arthritis. , 2012, Journal of the American Academy of Dermatology.

[47]  M. Zweig,et al.  Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. , 1993, Clinical chemistry.

[48]  Michael D. Buhrmester,et al.  Amazon's Mechanical Turk , 2011, Perspectives on psychological science : a journal of the Association for Psychological Science.

[49]  Jeffrey Mark Scagnelli What the Crowd Yields: Considerations When Crowdsourcing , 2013 .

[50]  Mark D. Smucker,et al.  The Crowd vs . the Lab : A Comparison of Crowd-Sourced and University Laboratory Participant Behavior , 2011 .

[51]  James E. Bartlett,et al.  Response rate, speed, and completeness: A comparison of Internet-based and mail surveys , 2002, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[52]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[53]  David J Weiss,et al.  Psychometric Evaluation and Calibration of Health-Related Quality of Life Item Banks: Plans for the Patient-Reported Outcomes Measurement Information System (PROMIS) , 2007, Medical care.

[54]  David G. Rand,et al.  Positive Interactions Promote Public Cooperation , 2009, Science.

[55]  A. Acquisti,et al.  Reputation as a sufficient condition for data quality on Amazon Mechanical Turk , 2013, Behavior Research Methods.

[56]  Morgan S. Polikoff Instructional Sensitivity as a Psychometric Property of Assessments , 2010 .

[57]  C. Lintott,et al.  Galaxy Zoo: morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey , 2008, 0804.4483.

[58]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..