Computer-Assisted Scoring of Short Responses: The Efficiency of a Clustering-Based Approach in a Real-Life Task

We present an extrinsic evaluation of a clustering-based approach to computer-assisted scoring of short constructed response items, as encountered in educational assessment. Due to their open-ended nature, constructed response items need to be graded by human readers, which makes the overall testing process costly and time-consuming. In this paper we investigate the prospects for streamlining the grading task by grouping similar responses for scoring. The efficiency of scoring clustered responses is compared both with the traditional mode of grading individual test-takers’ sheets and with by-item scoring of non-clustered responses. Evaluation of the three grading modes is carried out during real-life language proficiency tests of German as a Foreign Language. We show that a system based on basic clustering techniques and shallow features yields a promising trend of reducing grading time and performs as well as a system displaying test-taker sheets for scoring.

[1]  Sumit Basu,et al.  Divide and correct: using clusters to grade short answers at scale , 2014, L@S.

[2]  Sumit Basu,et al.  Powergrading: a Clustering Approach to Amplify Human Effort for Short Answer Grading , 2013, TACL.

[3]  Walt Detmar Meurers,et al.  Evaluating Answers to Reading Comprehension Questions in Context: Results for German and the Role of Information Structure , 2011, TextInfer@EMNLP.

[4]  Walt Detmar Meurers,et al.  Evaluating the Meaning of Answers to Reading Comprehension Questions: A Semantics-Based Approach , 2012, BEA@NAACL-HLT.

[5]  Rada Mihalcea,et al.  Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments , 2011, ACL.

[6]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[7]  Martin Chodorow,et al.  C-rater: Automated Scoring of Short-Answer Questions , 2003, Comput. Humanit..

[8]  Manfred Pinkal,et al.  Using the text to evaluate short answers for reading comprehension exercises , 2013, *SEMEVAL.

[9]  Stephen G. Pulman,et al.  Automatic Short Answer Marking , 2005, ACL 2005.

[10]  Magdalena Wolska,et al.  Finding a Tradeoff between Accuracy and Rater's Workload in Grading Clustered Short Answers , 2014, LREC.

[11]  Ben Hamner,et al.  Contrasting state-of-the-art automated scoring of essays: analysis , 2012 .

[12]  John Blackmore,et al.  Proceedings of the Twenty-Second International FLAIRS Conference (2009) c-rater:Automatic Content Scoring for Short Constructed Responses , 2022 .

[13]  Yasuyo Sawaki,et al.  A Reliable Approach to Automatic Assessment of Short Answer Free Responses , 2002, COLING.

[14]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2009, Information Retrieval.

[15]  Jill Burstein,et al.  Handbook of Automated Essay Evaluation Current Applications and New Directions , 2018 .

[16]  Rada Mihalcea,et al.  Text-to-Text Semantic Similarity for Automatic Short Answer Grading , 2009, EACL.