Balancing Cost and Quality: An Exploration of Human-in-the-loop Frameworks for Automated Short Answer Scoring

Short answer scoring (SAS) is the task of grading short text written by a learner. In recent years, deep-learning-based approaches have substantially improved the performance of SAS models, but how to guarantee high-quality predictions still remains a critical issue when applying such models to the education field. Towards guaranteeing highquality predictions, we present the first study of exploring the use of human-in-the-loop framework for minimizing the grading cost while guaranteeing the grading quality by allowing a SAS model to share the grading task with a human grader. Specifically, by introducing a confidence estimation method for indicating the reliability of the model predictions, one can guarantee the scoring quality by utilizing only predictions with high reliability for the scoring results and casting predictions with low reliability to human graders. In our experiments, we investigate the feasibility of the proposed framework using multiple confidence estimation methods and multiple SAS datasets. We find that our human-in-theloop framework allows automatic scoring models and human graders to achieve the target scoring quality.

[1]  Torsten Zesch,et al.  Don’t take “nswvtnvakgxpm” for an answer –The surprising vulnerability of automatic content scoring systems to adversarial input , 2020, COLING.

[2]  Kentaro Inui,et al.  Preventing Critical Scoring Errors in Short Answer Scoring with Confidence Estimation , 2020, ACL.

[3]  Oleg Sychev,et al.  Automatic grading and hinting in open-ended text questions , 2020, Cognitive Systems Research.

[4]  Kentaro Inui,et al.  Inject Rubrics into Short Answer Grading System , 2019, EMNLP.

[5]  Lilja Øvrelid,et al.  Regression or classification? Automated Essay Scoring for Norwegian , 2019, BEA@ACL.

[6]  Kentaro Inui,et al.  Analytic Score Prediction and Justification Identification in Automated Short Answer Scoring , 2019, BEA@ACL.

[7]  Swati Aggarwal,et al.  Get IT Scored Using AutoSAS - An Automated System for Scoring Short Answers , 2019, AAAI.

[8]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[9]  Deep learning for Short Answer Scoring , 2019 .

[10]  Isna Alfi Bustoni,et al.  UKARA: A Fast and Simple Automatic Short Answer Scoring System for Bahasa Indonesia , 2018, ICEAP Proceeding Book Vol 2.

[11]  Andrew Gordon Wilson,et al.  GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration , 2018, NeurIPS.

[12]  Maya R. Gupta,et al.  To Trust Or Not To Trust A Classifier , 2018, NeurIPS.

[13]  Torsten Zesch,et al.  Investigating neural architectures for short answer scoring , 2017, BEA@EMNLP.

[14]  Bronwyn Woods,et al.  Formative Essay Feedback Using Predictive Scoring Models , 2017, KDD.

[15]  Andrea Horbach,et al.  Investigating Active Learning for Short-Answer Scoring , 2016, BEA@NAACL-HLT.

[16]  Benno Stein,et al.  The Eras and Trends of Automatic Short Answer Grading , 2015, International Journal of Artificial Intelligence in Education.

[17]  Seung-Shik Kang,et al.  KASS: Korean Automatic Scoring System for Short-answer Questions , 2014, CSEDU.

[18]  David M. Williamson,et al.  A Framework for Evaluation and Use of Automated Scoring , 2012 .

[19]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[20]  Jill Burstein,et al.  AUTOMATED ESSAY SCORING WITH E‐RATER® V.2.0 , 2004 .

[21]  Martin Chodorow,et al.  C-rater: Automated Scoring of Short-Answer Questions , 2003, Comput. Humanit..