Evidence for Efficient Language Production in Chinese Ting Qian (ting.qian@rochester.edu) T. Florian Jaeger (fjaeger@bcs.rochester.edu) Department of Brain and Cognitive Sciences, University of Rochester Rochester, NY 14627 USA Abstract out−of−context information contextualized information Recent work proposes that language production is organized to facilitate efficient communication by means of transmitting information at a constant rate. However, evidence has almost exclusively come from English. We present new results from Mandarin Chinese supporting the hypothesis that Constant En- tropy Rate is observed cross-linguistically, and may be a uni- versal property of the language production system. We show that this result holds even if several important confounds that previous work failed to address are controlled for. Finally, we present evidence that Constant Entropy Rate is observed at the syllable level as well as the word level, suggesting findings do not depend on the chosen unit of observation. Keywords: constant entropy rate; efficient language produc- tion; Chinese; cross-linguistic study; information theory. sentence position (a) direct prediction sentence position (b) indirect prediction Figure 1: The direct and indirect predictions of CER Introduction The idea that language can be mathematically described just like any other communication system goes back at least to Shannon (1951), who suggests that anyone speaking a lan- guage should also possess a statistical knowledge of that lan- guage. According to Shannon, this statistical knowledge en- ables us to use language probabilistically, as evidenced by our ability to to fill in missing or incorrect letters in proof- reading, to complete an unfinished phrase in a conversation, or to perform other common tasks. Recent work has proposed that language users exploit this statistical knowledge for efficient language production (Gen- zel & Charniak, 2002; Aylett & Turk, 2004; Jaeger, 2006; Levy & Jaeger, 2007; van Son et al., 1998). Genzel and Char- niak (2002) hypothesize that speakers use their probabilistic knowledge of language to maintain a constant entropy rate in language production. According to the information theory, transmitting information at a constant rate through a noisy channel is communicatively optimal (Shannon, 1948). If speaker follow the principle of Constant Entropy Rate (hereafter, CER, Genzel and Charniak, 2002), we should ob- serve that the sentences they produce carry on average the same amount of information. This direct prediction of CER is illustrated in Figure 1a. However, the direct prediction of CER is difficult to examine because it is difficult to derive es- timate of sentences’ information content in context. Current natural language processing techniques (e.g. n-grams, proba- bilistic context-free grammar models) only assess the a priori, or out-of-context, information of a sentence. To circumvent this problem, Genzel and Charniak tested an indirect predic- tion of CER: out-of-context sentence information should in- crease throughout discourse. This indirect prediction of CER is illustrated in Figure 1b. To understand why out-of-context information ought to in- crease throughout discourse, one needs to look at the context- dependent nature of human communication. Utterances in a discourse build on each other. The information encoded in a string of words (or a stream of sounds) is co-determined by its context. In a situation where context is not properly provided, such as sentences that are randomly extracted from a well-structured discourse, the content will seem surprising without context. If speakers are efficient, they should thus encode less out-of-context information in sentence early in discourse than late in discourse, where more preceding dis- course will (on average) lower the actual (contextualized) in- formation content. However, a reverse pattern – too much in- formation at the beginning and too little at the end – may turn a discourse overwhelmingly difficult to understand at the be- ginning and barely informative toward the end. This is hardly efficient from the speaker’s perspective since it is likely to result in unsuccessful communication to the listener. Corpus studies have provided evidence for the indirect pre- diction of CER. For articles in the Wall Street Journal cor- pus, Genzel and Charniak (2002) found that average out-of- context sentence information increases throughout the dis- course (see also Keller, 2004). Piantadosi and Gibson (2008) found that data from spoken English also follow the predic- tion of CER. In this paper, we build on these previous findings. We address certain methodological shortcomings and extend the scope of the empirical investigation of CER in three impor- tant ways. First, previous studies reported only gross correla- tion between sentence information and sentence position. We use a linear mixed model to analyze the relation between out- of-context information content and sentence position, while controlling for possible confounds such as potentially non- linear effects of sentence length. Study 1 replicates Genzel
[1]
Claude E. Shannon,et al.
Prediction and Entropy of Printed English
,
1951
.
[2]
Slava M. Katz,et al.
Estimation of probabilities from sparse data for the language model component of a speech recognizer
,
1987,
IEEE Trans. Acoust. Speech Signal Process..
[3]
Florien J. van Beinum,et al.
Efficiency as an organizing principle of natural speech
,
1998,
ICSLP.
[4]
Eugene Charniak,et al.
Entropy Rate Constancy in Text
,
2002,
ACL.
[5]
Eugene Charniak,et al.
Variation of Entropy and Parse Trees of Sentences as a Function of the Sentence Number
,
2003,
EMNLP.
[6]
Frank Keller,et al.
The Entropy Rate Principle as a Predictor of Processing Effort: An Evaluation against Eye-tracking Data
,
2004,
EMNLP.
[7]
Alice Turk,et al.
The Smooth Signal Redundancy Hypothesis: A Functional Explanation for Relationships between Redundancy, Prosodic Prominence, and Duration in Spontaneous Speech
,
2004,
Language and speech.
[8]
Fei Xia,et al.
The Penn Chinese TreeBank: Phrase structure annotation of a large corpus
,
2005,
Natural Language Engineering.
[9]
Sang Joon Kim,et al.
A Mathematical Theory of Communication
,
2006
.
[10]
Thomas Hofmann,et al.
Speakers optimize information density through syntactic reduction
,
2007
.