Human language may have evolved through a stage when words were combined into structured linear segments, before these segments were used as building blocks for a hierarchical grammar. Experiments using information theoretic metrics show that such a stage could have its own evolutionary advantage, before the benefits of a full grammar are obtained. This hypothesis is approached by examining the apparently ubiquitous prevalence of homophones. It shows how, perhaps contrary to expectation, communicative capacity does not seem to be adversely affected by them, and they are routinely used without confusion. This is principally explained by disambiguation through syntactic processing of short word sequences. It indicates that local sequential processing plays an underlying role in language production and perception, a hypothesis that is supported by evidence that small children engage in this process as soon as they acquire words. Experiments on a corpus of spoken English calculated the entropy for sequences of syntactically labelled words. They show there is a measurable advantage in decoding word strings when they are taken in short sequences, rather than as individual items. This suggests that grammatical fragments of speech, could have been a stepping stone to a full grammar. Introduction The mapping of speech sounds onto meaning lies at the core of the human ability to communicate by language, and the limited range of sounds that other creatures can make contrasts markedly with the much wider range and combinatorial use of phonetic elements in human speech. The physiological changes to the vocal tract that were necessary to enable the production of speech sounds has concomitant disadvantages, but the value of the mechanisms exapted or adapted to support language appears to have outweighed these problems [1]. If human language had been designed to a teleological programme, we might have expected that there would be an optimum number of phonemes that provided the basis for speech. However, we find that the number of phonemes varies from about 12 to well over 100 [2]. There is massive redundancy. Some phonetic elements that can serve as particularly salient distinguishing features, such as clicks or ejectives, only occur in a subset of human languages. We see here not survival of the fittest, but survival of the many, varied, fit. We might also have expected a one-to-one mapping between sounds and meanings. Indeed, recent mathematical models showing how language might have evolved take this approach and show how a limited number of phonemes can be combined to produce an indefinitely large number of unambiguous words [3, 4]. Nowak asserts that “ambiguity ... is the loss of communicative capacity that arises if individual sounds are linked to more than one meaning” [3, p 613], that absence of word ambiguity is a mark of evolutionary fitness, and that word formation provides an exponential increase in fitness with length. However, these models do not reflect language in the real world. Seemingly ubiquitous homophony is common in English as in other languages, though it is certainly not the case that a shortage of phonetic elements leads to a need for the same sounds to have multiple meanings. Many of the most frequently used words are ambiguous homophones (for example: to, too, two; there, their; I, eye) [5]. In spite of the theoretical possibilities of exploiting combinatorial properties of a set of phonemes, this does not in practice necessarily occur, yet communicative capacity does not seem to be adversely affected. We find homophones in the speech of small children [5], and observe the slippage of language into forms with more homophones [6;7,p 5]. Analysis of homophones We can analyse homophones in two groups: those in which the homophonous forms are the same grammatical parts-of-speech, and those in which they are different. In English, and other languages [8], the second class is much the larger. Taking the smaller class first, semantic information may be necessary to distinguish these words. They may be distinct concepts spelt differently, such as hair and hare; or distinct concepts spelt the same such as (river) bank and (money) bank. They may have common ancestry, and been subject to a gradual semantic shift. For instance to stamp can have the distinct meanings to stamp a foot, or to stamp a letter. Linking these two meanings was a stage when letters were sealed with a heavy stamp. Homophonous forms may also be variations on a theme, as in the example from Wittgenstein of the word game [9, sections 66-76]. He points out that there is nothing common to all meanings of the word, but rather “a complicated network of similarities, overlapping and crisscrossing”. This class of homophones with the same parts-of-speech has been the subject of mathematical modelling, for example by Wang et al. [7], where a word refers to “an association between a meaning and an utterance”, and there seems to be an implicit assumption that they are content words. However, the much larger class of homophones that are different parts of speech raise significant issues and deserve further scrutiny. Homophonous forms are frequently function words, and the fact that we can disambiguate them with such facility provides clues to our underlying syntactic abilities. For example, the words to / too / two are used and understood correctly by children very early on. We see that disambiguation must be through contextual processing, and this contextual processing seems to be mainly based on relations with adjacent words (for example me too, two sweets, to the swing). The subconscious use of grammatical categories can explain how the appropriate lexical item is selected. Without invoking a full grammar, short word sequences, grammatical fragments, can be acceptable or not. Perception and production of syntactically correct phrases There is an ongoing debate as to how children acquire syntactic knowledge [10, 11, 12], but there is a general consensus that children from a very young age are aware of syntactic categories. Infants are aware of prosodic clues to syntactic elements, and can exploit them in the processing of speech [13, 14]. For instance, in English children use correct word order as soon as two words are produced [15]. This helps to explain how young children can understand phrases and sentences with homophonous terms: local syntactic constraints are employed as soon as words are acquired. Older speakers as much as infants are implicitly aware of syntactic categories. The fact that many could not explicitly define these categories does not detract from the proposition. In the same way, we can estimate the distance to a remote object implicitly using optical rules that we cannot explicitly formulate. If we accept this proposition, then we can see that the disambiguation of homophones will often be based on the admissibility or otherwise of neighbouring parts-of-speech. For instance, consider their / there. “their” is a possessive pronoun typically followed by a noun or noun phrase. “there” is not usually followed by a noun or noun phrase, but typically by a verb, adverb or preposition: Their adventures made a good story. Their thrilling exploits amazed us. There are many more to come. They went there quickly Figure 1 In Figure 1, the alternative forms their / there cannot be confused, because of local syntactic disambiguation. For homophonic function words like these, there is little or no content in them to aid disambiguation, nor is it necessary. Experiments on the efficient decoding of word strings The observations made so far suggest that processing short, syntactically labelled word sequences could play an underlying role in speech production and perception. To test this hypothesis, we carried out experiments to see if there was an advantage in processing words as short strings rather than as individual items. Using Information Theoretic tools we have investigated the efficiency of decoding word sequences segmented in different ways. The concept on which these experiments are based is that we can measure the entropy of a sequence, and a decline in entropy is associated with an increase in predictability, an improvement in the efficiency of decoding and comprehensibility [16]. For a simple introduction to this concept see [17]. A standard reference is [18]. Taking the proposition that we are implicitly aware of syntactic categories or part-of-speech tags, we investigate whether tag strings are more easily decoded if they are taken in short sequences rather than as single items. In the rest of this paper we take the term “tag” to mean “part-of-speech tag”. If we find that entropy declines as we take tags in pairs and triples, this would indicate that processing of short sequences is likely to have developed with improved understanding of speech. In turn, this would help explain how homophonous words are routinely used without confusion: they are disambiguated by being taken in conjunction with neighbouring words. For our experiments, we take the Machine Readable Spoken English Corpus – MARSEC, organized by Arnfield [19, 20]. About 26,000 words are used. MARSEC includes prosodic annotation, which we are not using in the current experiments. The corpus includes unscripted news commentary, scripted news and lectures. This can be considered “well formed” language, not like informal conversation. Experiments are planned on other types of spoken language. The first step in the experiment is to map words onto part-of-speech tags. This was done using a version of the CLAWS tagger (supplied by the University of Lancaster) described by Garside [21]. The CLAWS tagset was mapped onto a smaller customised tagset consisting of 26 partof-speech tags (Appendix A). The next stage is to measure the entropy in four cases: with no statistical information, then with information on single tags, tag pairs and tag triples. Taking the symbol H as entropy,
[1]
P. Lieberman.
On the nature and evolution of the neural bases of human language.
,
2002,
American journal of physical anthropology.
[2]
Ian Maddieson,et al.
Patterns of sounds
,
1986
.
[3]
P. Warren,et al.
GOLDILOCKS AND THE THREE BEERS: WORD RECOGNITION AND SOUND MERGER
,
2002
.
[4]
Thomas M. Cover,et al.
Elements of Information Theory
,
2005
.
[5]
Sandra Warren.
Phonological acquisition and ambient language : a corpus based cross-linguistic exploration
,
2001
.
[6]
James W. Minett,et al.
Computational Studies of Language Evolution
,
2003
.
[7]
Cynthia Fisher,et al.
The role of abstract syntactic knowledge in language acquisition: a reply to Tomasello (2000)
,
2002,
Cognition.
[8]
Chrystopher L. Nehaniv,et al.
The Segmentation of Speech and its Implications for the Emergence of Language Structure
,
2001
.
[9]
Claude E. Shannon,et al.
Prediction and Entropy of Printed English
,
1951
.
[10]
P. Niyogi,et al.
Computational and evolutionary aspects of language
,
2002,
Nature.
[11]
Sandra R. Waxman,et al.
Words as Invitations to Form Categories: Evidence from 12- to 13-Month-Old Infants
,
1995,
Cognitive Psychology.