Evaluating Computer Automated Scoring: Issues, Methods, and an Empirical Illustration

With the continual progress of computer technologies, computer automated scoring (CAS) has become a popular tool for evaluating writing assessments. Research of applications of these methodologies to new types of performance assessments is still emerging. While research has generally shown a high agreement of CAS system generated scores with those produced by human raters, concerns and questions have been raised about appropriate analyses and validity of decisions/interpretations based on those scores. In this paper we expand the emerging discussions on validation strategies on CAS by illustrating several analyses can be accomplished with available data. These analyses compare the degree to which two CAS systems accurately score data from a structured interview using the original scores provided by human raters as the criterion. Results suggest key differences across the two systems as well as differences in the statistical procedures used to evaluate them. The use of several statistical and qualitative analyses is recommended for evaluating contemporary CAS systems.

[1]  A. Stuart A TEST FOR HOMOGENEITY OF THE MARGINAL DISTRIBUTIONS IN A TWO-WAY CLASSIFICATION , 1955 .

[2]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[3]  Karen Kukich,et al.  Beyond Automated Essay Scoring , 2000 .

[4]  A. E. Maxwell Comparing the Classification of Subjects by Two Independent Judges , 1970, British Journal of Psychiatry.

[5]  A. Feinstein,et al.  High agreement but low kappa: II. Resolving the paradoxes. , 1990, Journal of clinical epidemiology.

[6]  Henry Braun,et al.  On the Synergy between Assessment and Instruction: Early Lessons from Computer-Based Simulations. , 1994 .

[7]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[8]  Stephen G. Clyman,et al.  The Generalizability of Scores for a Performance Assessment Scored with a Computer-Automated Scoring System. , 2000 .

[9]  L A Johnson,et al.  Dental Interactive Simulations Corporation (DISC): simulations for education, continuing education, and assessment. , 1998, Journal of Dental Education.

[10]  W. A. Scott,et al.  Reliability of Content Analysis ; The Case of Nominal Scale Cording , 1955 .

[11]  A. Feinstein,et al.  High agreement but low kappa: I. The problems of two paradoxes. , 1990, Journal of clinical epidemiology.

[12]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[13]  Thomas E. Piemme,et al.  Development of a Scoring Algorithm to Replace Expert Rating for Scoring a Complex Performance-Based Assessment , 1997 .

[14]  Stephen G. Clyman,et al.  A Comparison of the Generalizability of Scores Produced by Expert Raters and Automated Scoring Systems , 1999 .

[15]  W. Willett,et al.  Misinterpretation and misuse of the kappa statistic. , 1987, American journal of epidemiology.

[16]  A E Maxwell,et al.  Coefficients of Agreement Between Observers and Their Interpretation , 1977, British Journal of Psychiatry.

[17]  Chad W. Buckendahl,et al.  A Review of Strategies for Validating Computer-Automated Scoring , 2002 .

[18]  Michael T. Kane,et al.  Validity Issues for Performance-Based Tests Scored With Computer-Automated Scoring Systems , 2002 .

[19]  Stephen G. Clyman,et al.  Development of Automated Scoring Algorithms for Complex Performance Assessments: A Comparison of Two Approaches. , 1997 .

[20]  Randy Elliot Bennett,et al.  Validity and Automad Scoring: It's Not Only the Scoring , 1998 .

[21]  J. Fleiss Measuring agreement between two judges on the presence or absence of a trait. , 1975, Biometrics.

[22]  Isaac I. Bejar,et al.  A methodology for scoring open-ended architectural design problems. , 1991 .

[23]  R. Almond,et al.  Making Sense of Data From Complex Assessments , 2002 .

[24]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[25]  E. B. Page Computer Grading of Student Prose, Using Modern Concepts and Software , 1994 .

[26]  David M. Williamson,et al.  "Mental Model" Comparison of Automated and Human Scoring , 1999 .

[27]  T. Keith,et al.  Trait Ratings for Automated Essay Grading , 2002 .

[28]  Rebecca Zwick,et al.  Another look at interrater agreement. , 1988, Psychological bulletin.

[29]  Daniel Marcu,et al.  Benefits of Modularity in an Automated Essay Scoring System , 2000, COLING 2000.

[30]  William Wresch,et al.  The Imminence of Grading Essays by Computer-25 Years Later , 1993 .

[31]  Randy Elliot Bennett,et al.  VALIDITY AND AUTOMATED SCORING: IT'S NOT ONLY THE SCORING , 1997 .

[32]  W. Grove Statistical Methods for Rates and Proportions, 2nd ed , 1981 .

[33]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[34]  Stephen G. Clyman,et al.  Scoring a Performance-Based Assessment by Modeling the Judgments of Experts , 1995 .