Comparing Transformers and RNNs on predicting human sentence processing data

Recurrent neural networks (RNNs) have long been an architecture of interest for computational models of human sentence processing. The more recently introduced Transformer architecture has been shown to outperform recurrent neural networks on many natural language processing tasks but little is known about their ability to model human language processing. It has long been thought that human sentence reading involves something akin to recurrence and so RNNs may still have an advantage over the Transformer as a cognitive model. In this paper we train both Transformer and RNN based language models and compare their performance as a model of human sentence processing. We use the trained language models to compute surprisal values for the stimuli used in several reading experiments and use mixed linear modelling to measure how well the surprisal explains measures of human reading effort. Our analysis shows that the Transformers outperform the RNNs as cognitive models in explaining self-paced reading times and N400 strength but not gaze durations from an eye-tracking experiment.

[1]  John Hale,et al.  A Probabilistic Earley Parser as a Psycholinguistic Model , 2001, NAACL.

[2]  R. Levy Expectation-based syntactic comprehension , 2008, Cognition.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Franklin Chang,et al.  Language ERPs reflect learning through prediction error propagation , 2019, Cognitive Psychology.

[5]  Gabriella Vigliocco,et al.  Lexical surprisal as a general predictor of reading time , 2012, EACL.

[6]  Ioannis Konstas,et al.  Findings of the Third Workshop on Neural Generation and Translation , 2019, EMNLP.

[7]  Nathaniel J. Smith,et al.  The effect of word predictability on reading time is logarithmic , 2013, Cognition.

[8]  Frank Keller,et al.  Modeling Human Reading with Neural Attention , 2016, EMNLP.

[9]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10]  Florian Mohnert,et al.  Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information , 2018, BlackboxNLP@EMNLP.

[11]  Richard Futrell,et al.  Lossy‐Context Surprisal: An Information‐Theoretic Model of Memory Effects in Sentence Processing , 2020, Cogn. Sci..

[12]  Padraic Monaghan,et al.  Neural network models of language acquisition and processing , 2019 .

[13]  Matthew W. Crocker,et al.  A Neurocomputational Model of the N400 and the P600 in Language Processing , 2016, Cognitive science.

[14]  Frank Keller,et al.  Syntactic and Semantic Factors in Processing Difficulty: An Integrated Measure , 2010, ACL.

[15]  Nick Chater,et al.  The Now-or-Never bottleneck: A fundamental constraint on language , 2015, Behavioral and Brain Sciences.

[16]  Roland Schäfer,et al.  Processing and querying large web corpora with the COW14 architecture , 2015 .

[17]  John M. Henderson,et al.  Language structure in the brain: A fixation-related fMRI study of syntactic surprisal in reading , 2016, NeuroImage.

[18]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[21]  D. C. Mitchell,et al.  An Evaluation of Subject-Paced Reading Tasks and Other Methods for Investigating Immediate Processes in Reading 1 , 2018 .

[22]  K. Rayner,et al.  Contextual effects on word perception and eye movements during reading , 1981 .

[23]  James L. McClelland,et al.  Modelling the N400 brain potential as change in a probabilistic representation of meaning , 2018, Nature Human Behaviour.

[24]  J. Henderson,et al.  Use of verb information in syntactic parsing: evidence from eye movements and word-by-word self-paced reading. , 1990, Journal of experimental psychology. Learning, memory, and cognition.

[25]  K. Rayner Eye movements in reading and information processing: 20 years of research. , 1998, Psychological bulletin.

[26]  Graham Neubig,et al.  MTNT: A Testbed for Machine Translation of Noisy Text , 2018, EMNLP.

[27]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[28]  Tom M. Mitchell,et al.  Aligning context-based statistical models of language with brain activity during reading , 2014, EMNLP.

[29]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[30]  Shravan Vasishth,et al.  Brain Responses to World Knowledge Violations: A Comparison of Stimulus- and Fixation-triggered Event-related Potentials and Neural Oscillations , 2015, Journal of Cognitive Neuroscience.

[31]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[32]  S. Wood Stable and Efficient Multiple Smoothing Parameter Estimation for Generalized Additive Models , 2004 .

[33]  Robin L Thompson,et al.  Reading time data for evaluating broad-coverage models of English sentence processing , 2013, Behavior research methods.

[34]  Tal Linzen,et al.  The role of morphology in phoneme prediction: Evidence from MEG , 2014, Brain and Language.

[35]  M. Kutas,et al.  Reading senseless sentences: brain potentials reflect semantic incongruity. , 1980, Science.

[36]  Stefan Frank,et al.  Comparing Gated and Simple Recurrent Neural Network Architectures as Models of Human Sentence Processing , 2018, CogSci.

[37]  S. Frank,et al.  The ERP response to the amount of information conveyed by words in sentences , 2015, Brain and Language.

[38]  Willem Zuidema,et al.  Blackbox Meets Blackbox: Representational Similarity & Stability Analysis of Neural Language Models and Brains , 2019, BlackboxNLP@ACL.

[39]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[40]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[41]  Hermann Ney,et al.  A comprehensive study of deep bidirectional LSTM RNNS for acoustic modeling in speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Adam Goodkind,et al.  Predictive power of word surprisal for reading times is a linear function of language model quality , 2018, CMCL.