论文信息 - Toward More Substantively Meaningful Automated Essay Scoring

Toward More Substantively Meaningful Automated Essay Scoring

This study evaluated a “substantively driven” method for scoring NAEP writing assessments automatically. The study used variations of an existing commercial program, e-rater®, to compare the performance of three approaches to automated essay scoring: a brute-empirical approach in which variables are selected and weighted solely according to statistical criteria, a hybrid approach in which a fixed set of variables more closely tied to the characteristics of good writing was used but the weights were still statistically determined, and a substantively driven approach in which a fixed set of variables was weighted according to the judgments of two independent committees of writing experts. The research questions concerned (1) the reproducibility of weights across writing experts, (2) the comparison of scores generated by the three automated approaches, and (3) the extent to which models developed for scoring one NAEP prompt generalize to other NAEP prompts of the same genre. Data came from the 2002 NAEP Writing Online study and from the main NAEP 2002 writing assessment. Results showed that, in carrying out the substantively driven approach, experts initially assigned weights to writing dimensions that were highly similar across committees but that diverged from one another after committee 1 was shown the empirical weights for possible use in its judgments and committee 2 was not shown those weights. The substantively driven approach based on the judgments of committee 1 generally did not operate in a markedly different way from the brute empirical or hybrid approaches in most of the analyses conducted. In contrast, many consistent differences with those approaches were observed for the substantively driven approach based on the judgments of committee 2. This study suggests that empirical weights might provide a useful starting point for expert committees, with the understanding that the weights be moderated only somewhat to bring them more into line with substantive considerations. Under such circumstances, the results may turn out to be reasonable, though not necessarily as highly related to human ratings as statistically optimal approaches would produce.

Anat Ben-Simon | Randy Elliott Bennett

[1] Bob Rehder,et al. How Well Can Passage Meaning be Derived without Using Word Order? A Comparison of Latent Semantic Analysis and Humans , 1997 .

[2] Jill Burstein,et al. AUTOMATED ESSAY SCORING WITH E‐RATER® V.2.0 , 2004 .

[3] Robert J. Mislevy,et al. Automated scoring of complex tasks in computer-based testing , 2006 .

[4] E. B. Page. Computer Grading of Student Prose, Using Modern Concepts and Software , 1994 .

[5] J. Hayes. A new framework for understanding cognition and affect in writing. , 1996 .

[6] Randy Elliot Bennett,et al. Moving the Field Forward: Some Thoughts on Validity and Automated Scoring , 2004 .

[7] Julie Cheville,et al. Automated Scoring Technologies and the Rising Influence of Error. , 2004 .

[8] Donald E. Powers,et al. STUMPING E‐RATER: CHALLENGING THE VALIDITY OF AUTOMATED ESSAY SCORING , 2001 .

[9] Martin Chodorow,et al. Automated Essay Evaluation: The Criterion Online Writing Service , 2004, AI Mag..

[10] William Wresch,et al. The Imminence of Grading Essays by Computer-25 Years Later , 1993 .

[11] C. Michael Levy,et al. The Science of Writing : Theories, Methods, Individual Differences and Applications , 1996 .