Predictive power of word surprisal for reading times is a linear function of language model quality

Within human sentence processing, it is known that there are large effects of a word’s probability in context on how long it takes to read it. This relationship has been quantified using informationtheoretic surprisal, or the amount of new information conveyed by a word. Here, we compare surprisals derived from a collection of language models derived from n-grams, neural networks, and a combination of both. We show that the models’ psychological predictive power improves as a tight linear function of language model linguistic quality. We also show that the size of the effect of surprisal is estimated consistently across all types of language models. These findings point toward surprising robustness of surprisal estimates and suggest that surprisal estimated by low-quality language models are not biased.

[1]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[2]  Roger Levy,et al.  Sequential vs. Hierarchical Syntactic Models of Human Incremental Sentence Processing , 2012, CMCL@NAACL-HLT.

[3]  Andreas Stolcke,et al.  SRILM at Sixteen: Update and Outlook , 2011 .

[4]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[5]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[6]  Frank Keller,et al.  Syntactic and Semantic Factors in Processing Difficulty: An Integrated Measure , 2010, ACL.

[7]  John Hale,et al.  A Probabilistic Earley Parser as a Psycholinguistic Model , 2001, NAACL.

[8]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[9]  Wilson L. Taylor,et al.  “Cloze Procedure”: A New Tool for Measuring Readability , 1953 .

[10]  Nathaniel J. Smith,et al.  The effect of word predictability on reading time is logarithmic , 2013, Cognition.

[11]  Roger Levy,et al.  A Rational Model of Eye Movement Control in Reading , 2010, ACL.

[12]  S. Wood Stable and Efficient Multiple Smoothing Parameter Estimation for Generalized Additive Models , 2004 .

[13]  Robin L Thompson,et al.  Reading time data for evaluating broad-coverage models of English sentence processing , 2013, Behavior research methods.

[14]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[15]  Frank Keller,et al.  Data from eye-tracking corpora as evidence for theories of syntactic processing complexity , 2008, Cognition.

[16]  Gabriella Vigliocco,et al.  Lexical surprisal as a general predictor of reading time , 2012, EACL.

[17]  S. Frank,et al.  Insensitivity of the Human Sentence-Processing System to Hierarchical Structure , 2011, Psychological science.

[18]  Roger Levy,et al.  Cloze but no cigar: The complex relationship between cloze, corpus, and subjective probabilities in language processing , 2011, CogSci.

[19]  Virgilio Gómez-Rubio,et al.  Generalized Additive Models: An Introduction with R (2nd Edition) , 2018 .

[20]  R. Levy Expectation-based syntactic comprehension , 2008, Cognition.