This study evaluated expert system diagnoses of examinees' solutions to complex constructed-response algebra word problems. Problems were presented to three samples, each of which had taken the GRE General Test. One sample took the problems in paper-and-pencil form and the other two on computer. Responses were then diagnostically analyzed by an expert system, GIDE, and by four ETS mathematics test developers using a fine-grained categorization of error types. Results were highly consistent across the samples. Human judges agreed among themselves almost perfectly in describing responses as right or wrong but concurred at much lower levels (37% to 64% agreement) in categorizing the specific bugs they detected in incorrect solutions. The expert system agreed highly with the judges' right/wrong decisions (95% to 97% concurrence) and somewhat less closely (71% to 74%) with the bug categorizations that judges, themselves, agreed on. Seven principal causes of machine-rater disagreement were detected, most of which could be remedied by making adjustments to GIDE, modifying the test presentation interface to constrain the form of examinee solutions, and working with test developers to specify rules for automatically dealing with special cases. These results suggest that highly accurate diagnostic analysis through knowledge-based understanding of complex responses may be difficult to achieve at the fine-grained level used by GIDE. The accuracy of qualitative judgments might be increased by using a smaller set of more general diagnostic categories and by integrating information from other sources, including performance on diverse item types.
[1]
R. Mislevy.
A Framework for Studying Differences between Multiple-Choice and Free-Response Test Items.
,
1991
.
[2]
Randy M. Kaplan.
USING A TRAINABLE PATTERN‐DIRECTED COMPUTER PROGRAM TO SCORE NATURAL LANGUAGE ITEM RESPONSES
,
1991
.
[3]
Henry Braun,et al.
Scoring Constructed Responses Using Expert Systems
,
1990
.
[4]
Marc M. Sebrechts,et al.
THE CONVERGENT VALIDITY OF EXPERT SYSTEM SCORES FOR COMPLEX CONSTRUCTED‐RESPONSE QUANTITATIVE ITEMS
,
1991
.
[5]
Anthony E. Kelly,et al.
Studies of Diagnosis and Remediation with High School Algebra Students
,
1989,
Cogn. Sci..
[6]
Marc M. Sebrechts,et al.
Expert-System Scores for Complex Constructed-Response Quantitative Items: A Study of Convergent Validity
,
1991
.
[7]
Randy Elliot Bennett.
A Task Type for Measuring the Representational Component of Quantitative Proficiency. GRE Board Professional Report No. 92-05P.
,
1995
.
[8]
Randy Elliot Bennett,et al.
Fitting New Measurement Models to GRE General Test Constructed-Response Item Data. GRE Board Professional Report No. 89-11P.
,
1991
.