Improving Short Answer Grading Using Transformer-Based Pre-training

Dialogue-based tutoring platforms have shown great promise in helping individual students improve mastery. Short answer grading is a crucial component of such platforms. However, generative short answer grading using the same platform for diverse disciplines and titles is a crucial challenge due to data distribution variations across domains and a frequent occurrence of non-sentential answers. Recent NLP research has introduced novel deep learning architectures such as the Transformer, which merely uses self-attention mechanisms. Pre-trained models based on the Transformer architecture have been used to produce impressive results across a range of NLP tasks. In this work, we experiment with fine-tuning a pre-trained self-attention language model, namely Bidirectional Encoder Representations from Transformers (BERT) applying it to short answer grading, and show that it produces superior results across multiple domains. On the benchmarking dataset of SemEval-2013, we report up to 10% absolute improvement in macro-average-F1 over state-of-the-art results. On our two psychology domain datasets, the fine-tuned model yields classification almost up to the human-agreement levels. Moreover, we study the effectiveness of fine-tuning as a function of the size of the task-specific labeled data, the number of training epochs, and its generalizability to cross-domain and join-domain scenarios.

[1]  Shazia Afzal,et al.  Preliminary Evaluations of a Dialogue-Based Digital Tutor , 2018, AIED.

[2]  Bikram Sengupta,et al.  Creating Scoring Rubric from Representative Student Answers for Improved Short Answer Grading , 2018, CIKM.

[3]  Helen Yannakoudakis,et al.  Automatic Text Scoring Using Neural Networks , 2016, ACL.

[4]  Jonas Mueller,et al.  Siamese Recurrent Architectures for Learning Sentence Similarity , 2016, AAAI.

[5]  Sandra Katz,et al.  Is a Dialogue-Based Tutoring System that Emulates Helpful Co-constructed Relations During Human Tutoring Effective? , 2015, AIED.

[6]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[7]  Nitin Madnani,et al.  ETS: Domain Adaptation and Stacking for Short Answer Scoring , 2013, *SEMEVAL.

[8]  Peter W. Foltz,et al.  Identifying Patterns For Short Answer Scoring Using Graph-based Lexico-Semantic Text Matching , 2015, BEA@NAACL-HLT.

[9]  Chris Brew,et al.  SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge , 2013, *SEMEVAL.

[10]  Arthur C. Graesser,et al.  AutoTutor and affective autotutor: Learning by talking with cognitively and emotionally intelligent computers that talk back , 2012, TIIS.

[11]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[12]  Bikram Sengupta,et al.  Sentence Level or Token Level Features for Automatic Short Answer Grading?: Use Both , 2018, AIED.

[13]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[14]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[15]  Alexander F. Gelbukh,et al.  SOFTCARDINALITY: Hierarchical Text Overlap for Student Response Analysis , 2013, *SEMEVAL.

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Tamara Sumner,et al.  Fast and Easy Short Answer Grading with High Accuracy , 2016, NAACL.

[18]  Rada Mihalcea,et al.  Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments , 2011, ACL.

[19]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[20]  Arthur C. Graesser,et al.  DeepTutor: towards macro- and micro-adaptive conversational intelligent tutoring at scale , 2014, L@S.

[21]  Shourya Roy,et al.  Earth Mover's Distance Pooling over Siamese LSTMs for Automatic Short Answer Grading , 2017, IJCAI.

[22]  Peter W. Foltz,et al.  Generating Reference Texts for Short Answer Scoring Using Graph-based Summarization , 2015, BEA@NAACL-HLT.

[23]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[24]  Rui Yan,et al.  How Transferable are Neural Networks in NLP Applications? , 2016, EMNLP.

[25]  Rada Mihalcea,et al.  Text-to-Text Semantic Similarity for Automatic Short Answer Grading , 2009, EACL.