Language models, surprisal and fantasy in Slavic intercomprehension

Abstract In monolingual human language processing, the predictability of a word given its surrounding sentential context is crucial. With regard to receptive multilingualism, it is unclear to what extent predictability in context interplays with other linguistic factors in understanding a related but unknown language – a process called intercomprehension. We distinguish two dimensions influencing processing effort during intercomprehension: surprisal in sentential context and linguistic distance. Based on this hypothesis, we formulate expectations regarding the difficulty of designed experimental stimuli and compare them to the results from think-aloud protocols of experiments in which Czech native speakers decode Polish sentences by agreeing on an appropriate translation. On the one hand, orthographic and lexical distances are reliable predictors of linguistic similarity. On the other hand, we obtain the predictability of words in a sentence with the help of trigram language models. We find that linguistic distance (encoding similarity) and in-context surprisal (predictability in context) appear to be complementary, with neither factor outweighing the other, and that our distinguishing of these two measurable dimensions is helpful in understanding certain unexpected effects in human behaviour.

[1]  T. Florian Jaeger,et al.  Redundancy and reduction: Speakers manage syntactic information density , 2010, Cognitive Psychology.

[2]  R. V. Bezooijen,et al.  Lexical and orthographic distances between Germanic, Romance and Slavic languages and their relationship to geographic distance (Wilbert Heeringa, Jelena Golubovic, Charlotte Gooskens, Anja Schüppert, Femke Swarte & Stefanie Voigt) , 2013 .

[3]  Anja Schüppert,et al.  Why is Danish so difficult to understand for fellow Scandinavians? , 2016, Speech Commun..

[4]  R. Levy Expectation-based syntactic comprehension , 2008, Cognition.

[5]  Alexandr Rosen,et al.  The case of InterCorp, a multilingual parallel corpus , 2012 .

[6]  John Hale,et al.  A Probabilistic Earley Parser as a Psycholinguistic Model , 2001, NAACL.

[7]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[8]  Mira Nábělková Closely-related languages in contact: Czech, Slovak, “Czechoslovak” , 2007 .

[9]  Trevor A. Harley,et al.  The psychology of language : from data to theory , 2001 .

[10]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[11]  Frank Keller,et al.  Cognitively Plausible Models of Human Language Processing , 2010, ACL.

[12]  Charlotte Gooskens,et al.  Mutual intelligibility between West and South Slavic languages , 2015 .

[13]  Matthew W. Crocker,et al.  Information Density and Linguistic Encoding (IDeaL) , 2015, KI - Künstliche Intelligenz.

[14]  Jelena Golubović Mutual intelligibility in the Slavic language area , 2016 .

[15]  K. A. Ericsson,et al.  Protocol Analysis: Verbal Reports as Data , 1984 .

[16]  Renée van Bezooijen,et al.  Linguistic Determinants of the Intelligibility of Swedish Words among Danes , 2008, Int. J. Humanit. Arts Comput..

[17]  Charlotte Gooskens Experimental Methods for Measuring Intelligibility of Closely Related Language Varieties , 2013 .