Improving the Statistical Aspects of e-rater

This study explores alternative ways of reducing the number of variables/features and additional ways of combining information across features to produce more stable and accurate e-rater scores. Following an explanation of the statistical aspects of the process is a description of alternatives to the process. Our explorations resulted in certain conclusions and directions for future research. We have examined enough e-rater data to conclude that stepwise regression seems to be effective as a feature reduction procedure. However, this may be attributed to the consistently strong relationship with essay score that is observed for the content vector analysis (CVA) variables and the two variables used to approximate word length (number of auxiliary verbs and the ratio of the number of auxiliary verbs to the number of words). To yield better validation results, we also suggest that the hold-out method for evaluating validity should replace the current two-stage approach of first developing a model in a quasi-uniform training sample and then validating these results in a target cross-validation sample. More research is needed in several areas. First, explicit modeling of the part of essay scores that is unrelated to word length is warranted. The POM (Proportional Odds Model) approach should be investigated in greater depth. Also needed is a statistical justification for using essay scores to score CVA variables. Algorithmic approaches to prediction/classification problem, such as boosting, may prove fruitful. Further investigation of quantile regression and ridge regression should be conducted.

[1]  Neil J. Dorans,et al.  A note on cross-validating prediction equations , 1980 .

[2]  L. Tucker A METHOD FOR SYNTHESIS OF FACTOR ANALYSIS STUDIES , 1951 .

[3]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[4]  J. Fleiss,et al.  Quantification of agreement in psychiatric diagnosis revisited. , 1987, Archives of general psychiatry.

[5]  N. Dorans,et al.  USING CONFUSION INFUSION AND CONFUSION REDUCTION INDICES TO COMPARE ALTERNATIVE ESSAY SCORING RULES , 2003 .

[6]  F. Drasgow,et al.  Alternative weighting schemes for linear prediction , 1978 .

[7]  Martin Chodorow,et al.  Computer Analysis of Essay Content for Automated Score Prediction , 1998 .

[8]  Jacob Cohen,et al.  The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability , 1973 .

[9]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[10]  A. E. Hoerl,et al.  Ridge regression:some simulations , 1975 .

[11]  T. Allison,et al.  A New Procedure for Assessing Reliability of Scoring EEG Sleep Recordings , 1971 .

[12]  Hrishikesh D. Vinod,et al.  Recent Advances in Regression Methods. , 1983 .

[13]  C. Goose,et al.  Glossary of Terms , 2004, Machine Learning.

[14]  S. Weisberg,et al.  Applied Linear Regression (2nd ed.). , 1986 .

[15]  Hunter M. Breland,et al.  Factors in Performance on Brief, Impromptu Essay Examinations. College Board Report No. 95-4. , 1995 .

[16]  E. B. Page,et al.  The Computer Moves into Essay Grading: Updating the Ancient Test. , 1995 .

[17]  Charles E. Heckler,et al.  Applied Multivariate Statistical Analysis , 2005, Technometrics.

[18]  Malik Beshir Malik,et al.  Applied Linear Regression , 2005, Technometrics.

[19]  R. Koenker,et al.  Computing regression quantiles , 1987 .

[20]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[21]  M. R. Mickey,et al.  Estimation of Error Rates in Discriminant Analysis , 1968 .

[22]  R. Hogarth,et al.  Unit weighting schemes for decision making , 1975 .