Preventing Critical Scoring Errors in Short Answer Scoring with Confidence Estimation

Many recent Short Answer Scoring (SAS) systems have employed Quadratic Weighted Kappa (QWK) as the evaluation measure of their systems. However, we hypothesize that QWK is unsatisfactory for the evaluation of the SAS systems when we consider measuring their effectiveness in actual usage. We introduce a new task formulation of SAS that matches the actual usage. In our formulation, the SAS systems should extract as many scoring predictions that are not critical scoring errors (CSEs). We conduct the experiments in our new task formulation and demonstrate that a typical SAS system can predict scores with zero CSE for approximately 50% of test data at maximum by filtering out low-reliablility predictions on the basis of a certain confidence estimation. This result directly indicates the possibility of reducing half the scoring cost of human raters, which is more preferable for the evaluation of SAS systems.

[1]  Peter W. Foltz,et al.  The intelligent essay assessor: Applications to educational technology , 1999 .

[2]  Torsten Zesch,et al.  Investigating neural architectures for short answer scoring , 2017, BEA@EMNLP.

[3]  Sumit Basu,et al.  Powergrading: a Clustering Approach to Amplify Human Effort for Short Answer Grading , 2013, TACL.

[4]  Kentaro Inui,et al.  Analytic Score Prediction and Justification Identification in Automated Short Answer Scoring , 2019, BEA@ACL.

[5]  Rada Mihalcea,et al.  Text-to-Text Semantic Similarity for Automatic Short Answer Grading , 2009, EACL.

[6]  Klaus Zechner,et al.  Automated Essay Scoring: Writing Assessment and Instruction , 2010 .

[7]  E. B. Page Computer Grading of Student Prose, Using Modern Concepts and Software , 1994 .

[8]  Hwee Tou Ng,et al.  A Neural Approach to Automated Essay Scoring , 2016, EMNLP.

[9]  Sunita Sarawagi,et al.  Trainable Calibration Measures For Neural Networks From Kernel Mean Embeddings , 2018, ICML.

[10]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[11]  Maya R. Gupta,et al.  To Trust Or Not To Trust A Classifier , 2018, NeurIPS.

[12]  Benno Stein,et al.  The Eras and Trends of Automatic Short Answer Grading , 2015, International Journal of Artificial Intelligence in Education.

[13]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[14]  Kentaro Inui,et al.  Inject Rubrics into Short Answer Grading System , 2019, EMNLP.

[15]  Brendan T. O'Connor,et al.  Posterior calibration and exploratory analysis for natural language processing models , 2015, EMNLP.

[16]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18]  Rada Mihalcea,et al.  Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments , 2011, ACL.

[19]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[20]  Martin Chodorow,et al.  C-rater: Automated Scoring of Short-Answer Questions , 2003, Comput. Humanit..

[21]  Jill Burstein,et al.  AUTOMATED ESSAY SCORING WITH E‐RATER® V.2.0 , 2004 .