Monitoring of Scoring Using the e‐rater® Automated Scoring System and Human Raters on a Writing Test

This article proposes and investigates several methodologies for monitoring the quality of constructed-response (CR) scoring, both human and automated. There is an increased interest in the operational scoring of essays using both automated scoring and human raters. There is also evidence of rater effects—scoring severity and score inconsistency by human raters. Recently, automated scoring of CRs was successfully implemented with human scoring for operational programs (TOEFL® and GRE® tests); however, there is much that is not yet known about the performance of automated scoring systems. Hence, for quality assurance purposes, there is the need to provide a consistent and standardized approach to monitor the quality of the CR scoring over time and across programs. Monitoring the scoring results will help provide scores that are both fair and accurate for test takers and test users, enabling testing programs to detect and correct changes in the severity of scoring.

[1]  Machteld Hoskens,et al.  The Rater Bundle Model , 2001 .

[2]  Martin Chodorow,et al.  Beyond Essay Length: Evaluating e-rater[R]'s Performance on TOEFL[R] Essays. Research Reports. Report 73. RR-04-04. , 2004 .

[3]  Isaac I. Bejar,et al.  A validity-based approach to quality control and assurance of automated scoring , 2011 .

[4]  Shelby J. Haberman,et al.  Use of e‐rater® in Scoring of the TOEFL iBT® Writing Test , 2011 .

[5]  David M. Williamson,et al.  EVALUATION OF THE E‐RATER® SCORING ENGINE FOR THE GRE® ISSUE AND ARGUMENT PROMPTS , 2012 .

[6]  Yigal Attali,et al.  CONSTRUCT VALIDITY OF E‐RATER® IN SCORING TOEFL® ESSAYS , 2007 .

[7]  Brian W. Junker,et al.  The Hierarchical Rater Model for Rated Test Items and its Application to Large-Scale Educational Assessment Data , 2002 .

[8]  David M. Williamson,et al.  A Framework for Evaluation and Use of Automated Scoring , 2012 .

[9]  Jill Burstein,et al.  AUTOMATED ESSAY SCORING WITH E‐RATER® V.2.0 , 2004 .

[10]  P. Lachenbruch Statistical Power Analysis for the Behavioral Sciences (2nd ed.) , 1989 .

[11]  Brent Bridgeman,et al.  Performance of a Generic Approach in Automated Essay Scoring , 2010 .

[12]  W. A. Shewhart,et al.  Quality control charts , 1926 .

[13]  Leonard S. Cahen,et al.  Educational Testing Service , 1970 .

[14]  von Davier,et al.  The Use of Quality Control and Data Mining Techniques for Monitoring Scaled Scores: An Overview. Research Report. ETS RR-12-20. , 2012 .

[15]  Shelby J. Haberman,et al.  SAMPLE-SIZE REQUIREMENTS FOR AUTOMATED ESSAY SCORING , 2008 .

[16]  Lihua Yao,et al.  THE EFFECTS OF RATER SEVERITY AND RATER DISTRIBUTION ON EXAMINEES' ABILITY ESTIMATION FOR CONSTRUCTED‐RESPONSE ITEMS , 2013 .

[17]  George Engelhard,et al.  Examining Rater Errors in the Assessment of Written Composition With a Many-Faceted Rasch Model , 1994 .

[18]  M. Chodorow,et al.  BEYOND ESSAY LENGTH: EVALUATING E-RATER®'S PERFORMANCE ON TOEFL® ESSAYS , 2004 .

[19]  Shelby J. Haberman,et al.  Sample-Size Requirements for Automated Essay Scoring. Research Report. ETS RR-08-32. , 2008 .

[20]  M. H. Omar,et al.  Statistical Process Control Charts for Measuring and Monitoring Temporal Consistency of Ratings , 2010 .

[21]  Edward W. Wolfe,et al.  Monitoring Rater Performance Over Time: A Framework for Detecting Differential Accuracy and Differential Scale Category Use , 2009 .

[22]  Shelby J. Haberman,et al.  Use of e-rater[R] in Scoring of the TOEFL iBT[R] Writing Test. Research Report. ETS RR-11-25. , 2011 .

[23]  E W Wolfe,et al.  Detecting differential rater functioning over time (DRIFT) using a Rasch multi-faceted rating scale model. , 2001, Journal of applied measurement.

[24]  Nicholas T. Longford Models for Uncertainty in Educational Testing , 1995 .

[25]  Lawrence T. DeCarlo,et al.  Studies of a Latent Class Signal Detection Model for Constructed Response Scoring II: Incomplete and Hierarchical Designs. Research Report. ETS RR-10-08. , 2010 .