Building a Textual Entailment Suite for the Evaluation of Automatic Content Scoring Technologies

Automatic content scoring for free-text responses has started to emerge as an application of Natural Language Processing in its own right, much like question answering or machine translation. The task, in general, is reduced to comparing a student’s answer to a model answer. Although a considerable amount of work has been done, common benchmarks and evaluation measures for this application do not currently exist. It is yet impossible to perform a comparative evaluation or progress tracking of this application across systems – an application that we view as a textual entailment task. This paper concentrates on introducing an Educational Testing Service-built test suite that makes a step towards establishing such a benchmark. The suite can be used as regression and performance evaluations both intra-c-rater or inter automatic content scoring technologies. It is important to note that existing textual entailment test suites like PASCAL RTE or FraCas, though beneficial, are not suitable for our purposes since we deal with atypical naturally-occurring student responses that need to be categorized in order to serve as regression test cases.

[1]  Stephen Pulman,et al.  Automarking: using computational linguistics to score short‚ free−text responses , 2003 .

[2]  A. Graesser,et al.  Improving an intelligent tutor ’ s comprehension of students with Latent Semantic Analysis ∗ , 1999 .

[3]  Rada Mihalcea,et al.  Text-to-Text Semantic Similarity for Automatic Short Answer Grading , 2009, EACL.

[4]  Bob Rehder,et al.  Using latent semantic analysis to assess knowledge: Some technical considerations , 1998 .

[5]  Lucy Vanderwende,et al.  What Syntax Can Contribute in the Entailment Task , 2005, MLCW.

[6]  Rashmi Prasad,et al.  Comparing test-suite based evaluation and corpus-based evaluation of a wide-coverage grammar for English , 2001 .

[7]  Leah S. Larkey,et al.  Automatic essay grading using text categorization techniques , 1998, SIGIR '98.

[8]  R. Siddiqi,et al.  A systematic approach to the automated marking of short-answer questions , 2008, 2008 IEEE International Multitopic Conference.

[9]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[10]  J. R. Christie,et al.  Automated Essay Marking for Content ~ does it work? , 2003 .

[11]  Robert Williams,et al.  Automatically grading essays with markit , 2004 .

[12]  Tom Mitchell,et al.  Towards robust computerised marking of free-text responses , 2002 .

[13]  Martin Chodorow,et al.  C-rater: Automated Scoring of Short-Answer Questions , 2003, Comput. Humanit..

[14]  Tracy Holloway King,et al.  Designing Testsuites for Grammar-based Systems in Applications , 2008, COLING 2008.

[15]  Lawrence M. Rudner,et al.  Automated Essay Scoring Using Bayes' Theorem , 2002 .

[16]  O. Mason,et al.  Automated free text marking with Paperless School , 2002 .

[17]  Carolyn Penstein Rosé,et al.  A Hybrid Text Classification Approach for Analysis of Student Essays , 2003, HLT-NAACL 2003.

[18]  Peter W. Foltz,et al.  Automated Essay Scoring: Applications to Educational Technology , 1999 .

[19]  John Blackmore,et al.  Proceedings of the Twenty-Second International FLAIRS Conference (2009) c-rater:Automatic Content Scoring for Short Constructed Responses , 2022 .

[20]  Winston Bennett,et al.  Latent Semantic Analysis for Career Field Analysis and Information Operations , 2002 .

[21]  Beata Beigman Klebanov,et al.  Automated Essay Scoring , 2021, Synthesis Lectures on Human Language Technologies.

[22]  李幼升,et al.  Ph , 1989 .