Contrasting State-of-the-Art in the Machine Scoring of Short-Form Constructed Responses

This study compared short-form constructed responses evaluated by both human raters and machine scoring algorithms. The context was a public competition on which both public competitors and commercial vendors vied to develop machine scoring algorithms that would match or exceed the performance of operational human raters in a summative high-stakes testing environment. Data (N = 25,683) were drawn from three different states, employed 10 different prompts, and were drawn from two different secondary grade levels. Samples ranging in size from 2,130 to 2,999 were randomly selected from the data sets provided by the states and then randomly divided into three sets: a training set, a test set, and a validation set. Machine performance on all of the agreement measures failed to match that of the human raters. The current study concluded with recommendations on steps that might improve machine-scoring algorithms before they can be used in any operational way.

[1]  James H. McMillan,et al.  Classroom Assessment: Principles and Practice for Effective Instruction , 1996 .

[2]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[3]  Sally E. Jordan,et al.  A comparison of human and computer marking of short free-text student responses , 2010, Comput. Educ..

[4]  Mark D. Shermis,et al.  State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration , 2014 .

[5]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[6]  Donald E. Powers,et al.  AUTOMATED SCORING OF SHORT‐ANSWER OPEN‐ENDED GRE® SUBJECT TEST ITEMS , 2008 .

[7]  Chad W. Buckendahl,et al.  A Review of Strategies for Validating Computer-Automated Scoring , 2002 .

[8]  Martin Chodorow,et al.  C-rater: Automated Scoring of Short-Answer Questions , 2003, Comput. Humanit..

[9]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[10]  Randy Elliot,et al.  Automated Scoring of Constructed-Response Literacy and Mathematics Items , 2011 .

[11]  John Blackmore,et al.  Proceedings of the Twenty-Second International FLAIRS Conference (2009) c-rater:Automatic Content Scoring for Short Constructed Responses , 2022 .

[12]  Shameem Nyla NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION , 2004 .

[13]  Harry P. Hatry,et al.  Elementary and secondary education , 1989 .

[14]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[15]  Chris Brew,et al.  Automated Short Answer Scoring: Principles and Prospects , 2013 .

[16]  Martin Chodorow,et al.  Automated Evaluation of Discourse Coherence Quality in Essay Writing , 2013 .

[17]  Mary E. Piontek,et al.  BEST PRACTICES FOR DESIGNING AND GRADING EXAMS , 2008 .

[18]  David M. Williamson,et al.  A Framework for Evaluation and Use of Automated Scoring , 2012 .

[19]  Secondary Education. Release of February 2009 MCAS Biology Test Items , 2009 .

[20]  Jill Burstein,et al.  Automated Essay Scoring : A Cross-disciplinary Perspective , 2003 .

[21]  Ben Hamner,et al.  Contrasting state-of-the-art automated scoring of essays: analysis , 2012 .