论文信息 - Managing What We Can Measure: Quantifying the Susceptibility of Automated Scoring Systems to Gaming Behavior

Managing What We Can Measure: Quantifying the Susceptibility of Automated Scoring Systems to Gaming Behavior

As methods for automated scoring of constructed-response items become more widely adopted in state assessments, and are used in more consequential operational configurations, it is critical that their susceptibility to gaming behavior be investigated and managed. This article provides a review of research relevant to how construct-irrelevant response behavior may affect automated constructed-response scoring, and aims to address a gap in that literature: the need to assess the degree of risk before operational launch. A general framework is proposed for evaluating susceptibility to gaming, and an initial empirical demonstration is presented using the open-source short-answer scoring engines from the Automated Student Assessment Prize (ASAP) Challenge.

Derrick Higgins | Michael Heilman | Michael Heilman | D. Higgins

[1] Xiaoming Xi,et al. Automatic scoring of non-native spontaneous speech in tests of spoken English , 2009, Speech Commun..

[2] Jian Cheng,et al. Validating automated speaking tests , 2010 .

[3] Jian Cheng,et al. Off-Topic Detection in Automated Speech Assessment Applications , 2011, INTERSPEECH.

[4] Su-Youn Yoon,et al. Non-English Response Detection Method for Automated Proficiency Scoring System , 2011, BEA@ACL.

[5] Derrick Higgins,et al. EVALUATING THE CONSTRUCT‐COVERAGE OF THE E‐RATER® SCORING ENGINE , 2009 .

[6] E. B. Page. Computer Grading of Student Prose, Using Modern Concepts and Software , 1994 .

[7] E. B. Page,et al. The use of the computer in analyzing student essays , 1968 .

[8] Jill Burstein,et al. Identifying off-topic student essays without topic-specific training data , 2006, Natural Language Engineering.

[9] David M. Williamson,et al. A Framework for Evaluation and Use of Automated Scoring , 2012 .

[10] Ryan Shaun Joazeiro de Baker,et al. Off-task behavior in the cognitive tutor classroom: when students "game the system" , 2004, CHI.

[11] Xiaoming Xi,et al. A comparison of two scoring methods for an automated speech scoring system , 2012 .

[12] Randy Elliot Bennett,et al. Validity and Automad Scoring: It's Not Only the Scoring , 1998 .

[13] Averil Coxhead. A New Academic Word List , 2000 .

[14] Alek Kolcz,et al. Improve Spam Filtering by Detecting Gray Mail , 2007, CEAS.

[15] Y. Attali,et al. Scoring with the computer: Alternative procedures for improving the reliability of holistic essay scoring , 2013 .

[16] T. Landauer. Automatic Essay Assessment , 2003 .