Language models and Automated Essay Scoring

In this paper, we present a new comparative study on automatic essay scoring (AES). The current state-of-the-art natural language processing (NLP) neural network architectures are used in this work to achieve above human-level accuracy on the publicly available Kaggle AES dataset. We compare two powerful language models, BERT and XLNet, and describe all the layers and network architectures in these models. We elucidate the network architectures of BERT and XLNet using clear notation and diagrams and explain the advantages of transformer architectures over traditional recurrent neural network architectures. Linear algebra notation is used to clarify the functions of transformers and attention mechanisms. We compare the results with more traditional methods, such as bag of words (BOW) and long short term memory (LSTM) networks.

[1]  Martin T. Hagan,et al.  Application of new training methods for neural model reference control , 2018, Eng. Appl. Artif. Intell..

[2]  William Wresch,et al.  The Imminence of Grading Essays by Computer-25 Years Later , 1993 .

[3]  Xuanjing Huang,et al.  How to Fine-Tune BERT for Text Classification? , 2019, CCL.

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[6]  Judith Arter Rubrics, Scoring Guides, and Performance Criteria: Classroom Tools for Assessing and Improving Student Learning. , 2000 .

[7]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[8]  Mohammad Bagher Menhaj,et al.  Training feedforward networks with the Marquardt algorithm , 1994, IEEE Trans. Neural Networks.

[9]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[10]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[11]  Helen Yannakoudakis,et al.  Evaluating the performance of Automated Text Scoring systems , 2015, BEA@NAACL-HLT.

[12]  Mark D. Shermis,et al.  Contrasting State-of-the-Art in the Machine Scoring of Short-Form Constructed Responses , 2015 .

[13]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[14]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[15]  Helen Yannakoudakis,et al.  Automatic Text Scoring Using Neural Networks , 2016, ACL.

[16]  Nitin Madnani,et al.  Effective Feature Integration for Automated Short Answer Scoring , 2015, NAACL.

[17]  Martin T. Hagan,et al.  Enhanced recurrent network training , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Yoshua Bengio,et al.  Gated Feedback Recurrent Neural Networks , 2015, ICML.

[20]  Hwee Tou Ng,et al.  A Neural Approach to Automated Essay Scoring , 2016, EMNLP.

[21]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[22]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[23]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[24]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[25]  E. B. Page,et al.  The use of the computer in analyzing student essays , 1968 .

[26]  David M. Williamson,et al.  A Framework for Evaluation and Use of Automated Scoring , 2012 .

[27]  K. Gwet Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters , 2014 .

[28]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[29]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[30]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[31]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[32]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.