Early acquisition of syntactic categories: A formal model

We propose an explicit, incremental strategy by which children could group words with similar syntactic privileges into discrete, unlabeled categories. This strategy, which can discover lexical ambiguity, is based in part on a generalization of the idea of sentential minimal pairs. As a result, it makes minimal assumptions about the availability of syntactic knowledge at the onset of categorization. Although the proposed strategy is distributional, it can make use of categorization cues from other domains, including semantics and phonology. Computer simulations show that this strategy is effective at categorizing words in both artificial-language samples and transcripts of naturally-occurring, child-directed speech. Further, the simulations show that the proposed strategy performs even better when supplied with semantic information about concrete nouns. Implications for theories of categorization are discussed. Early Acquisition of Syntactic Categories Page 3 The Role of Distributional Analysis in Grammatical Category Acquisition As a part of acquiring a language, children must learn the grammatical categories of individual words. This is difficult, because the same phonological word can be assigned to different categories across languages and even within one language. For instance, the word /si/ is a verb (see) or noun ( sea) in English, and a conjunction ( si, ‘if’) or adverb (si, ‘so’) in French. In this paper, we propose a novel theory that consists of a strategy by which young children could exploit distributional information to categorize words. Further, we present a series of computer simulations demonstrating that this strategy is effective. The theory is motivated by formal principles of statistical inference and stated formally, but, as we show, its qualitative properties are clear and easy to understand. Previous research has focused on discovering sources of information that children could exploit to categorize words, on weighing the relative importance of each source, and on developing learning processes that exploit these information sources. The primary information sources that have been explored are distributional regularity (e.g., Maratsos & Chalkley, 1980), syntactic knowledge (e.g., Pinker, 1984), semantics (e.g., Grimshaw, 1981), and phonology (e.g., Kelly, 1992). Distributional regularity may contribute to categorization because the ordering of categories in sentences is restricted by the language. Children could exploit these distributional regularities by observing the restricted environments in which words occur and basing grammatical categories on sets of words that share identical or even similar privileges (e.g., Maratsos & Chalkley, 1980). Children may use the syntactic structure of sentences to confine their analyses to smaller, more appropriate domains (Pinker, 1984, 1987). As for semantics, many people have observed that certain semantic features are regularly and almost universally correlated with grammatical Early Acquisition of Syntactic Categories Page 4 categories (Bates & MacWhinney, 1982; Schlesinger, 1988); for instance, words referring to concrete objects are almost always nouns. Although semantic categories do not perfectly align with grammatical categories, they may provide a foundation that can be generalized by subsequent, purely distributional analyses (Grimshaw, 1981; Macnamara, 1982; Pinker, 1984, 1987). Finally, phonological correlates to grammatical category exist in many languages (e.g., Kelly, 1992). The theory proposed in this paper is based on the use of distributional information. This theory improves upon previous distribution-based proposals, because (a) it makes few assumptions about the availability of syntactic knowledge, yet is compatible with modern theories of syntax acquisition; (b) it assumes sentences are processed one at a time and are forgotten after processing; (c) it results in a discrete categorization of input tokens; (d) it allows word types to be put in more than one category; (e) it can exploit other sources of information pertaining to categorization, such as semantics; and (f) it combines all these properties in an detailed, explicit learning strategy. In the remainder of this introduction, we review evidence for the importance of both distributional and semantic information in categorization, then describe how the proposed strategy exploits distributional information in a novel way. Experiments 1 and 2 demonstrate that computer simulations of the strategy are very successful at learning the categories implicit in samples generated from artificial languages that are defined exclusively in distributional terms. Experiments 3 and 4 show that the same simulation program can learn grammatical categories from transcriptions of naturally-occurring, child-directed speech. Finally, in Experiment 5, we present one way in which semantic information could be exploited within our theoretical framework, and demonstrate that the simulation program of Experiments 1–4 benefits from the addition of this Early Acquisition of Syntactic Categories Page 5 semantic information. In the General Discussion, we relate the quantitative evidence of the experiments to qualitative properties of the simulations, discuss the theoretical implications of the results, and suggest directions for future work. Theories of Category Acquisition The first serious attempt to develop ideas about grammatical categories and how they can be discovered came from the structuralist linguists; Harris (1951, 1954) led this effort, introducing the terms distribution and distributional analysis . The distribution of a word was defined as “the sum of all its environments” (Harris, 1954, p. 146); a word’s environment, in turn, was defined as its position relative to other words in all utterances in which it occurred. To simplify the description of environments, classes of words could stand in place of individual words. Thus, utterances were thought to be formed “by choosing members of those classes that regularly occur together and in the order in which these classes occur” (Harris, 1954, p. 146). This last idea, including its probabilistic nature, is directly reflected in the learning strategy proposed in this paper; we call the sequence of classes describing an utterance its template. Harris intended his work to help other linguists discover and describe the distributional patterns of a language, and did not attempt a full analysis of English; Fries (1952), however, did attempt a full analysis. 1 He constructed simple utterance templates to identify word categories. For instance, any single word that could grammatically complete the template “The ____ is good” was part of his Class A (i.e., was a noun). This template is not sufficient to find all nouns (e.g., cats), so the template was generalized to “(The) ____ is/was/are/were good.” (meaning The was optional and several choices of verb were permitted). By creating and generalizing a set of similar templates, Fries identified a total of 19 word classes. Early Acquisition of Syntactic Categories Page 6 Harris, Fries, and other structuralists sharpened intuitions about the importance of distribution for categorization, but they never described a fully explicit, practical process by which distribution could be exploited. Their goal was to tell other linguists what to look for and to give examples of analyses. Nevertheless, the structuralists contributed three important ideas. First, lexical categories must be defined in terms of their distribution. Second, distribution is defined with respect to some structured environment. Although linguists have refined the structural descriptions of sentences, it is still true that a categorization strategy based on distributional analysis depends on the analysis of structured environments (e.g., Elman, 1990; Pinker, 1984). In this paper, we show that analyzing even minimally structured environments is useful for early category acquisition. Finally, structuralists realized that generalizing environments from sequences of words to sequences of classes (templates) yields more compact and understandable descriptions of sentences. We use templates in our learning strategy for an additional reason: Generalization is necessary because no language learner ever hears enough minimal pairs of sentences to learn a complete categorization of words. Since these ideas were first introduced, many linguists, psycholinguists, and cognitive psychologists have proposed theories of grammatical category acquisition. Some have examined the sources of information that children could exploit in learning the categories of words, building theories around the use of distribution, syntax, semantics, or phonology; others have tried to simulate category acquisition with computer models. In the next two subsections, we briefly review the development of theories of category acquisition. Cognitive Theories Semantic cues. In grade school we were taught that a noun is person, place, thing, or idea, Early Acquisition of Syntactic Categories Page 7 that a verb is an action, and so on; that is, grammatical categories were defined semantically. If this were true, category acquisition would be simple: Learn the meanings of words, and their categories would follow immediately. But as the structuralists argued, grammatical categories are just that—grammatical—and must ultimately have structural definitions based on their use in a grammar (see also Maratsos, 1988; Pinker, 1984). However, the correlation between grammatical categories and semantic classes is both strong and universal (Bates & MacWhinney, 1982; Schlesinger, 1988). Perhaps children use this correlation to begin learning syntactic categories by grouping together words that refer to the same semantic category. For example, children could infer that since cat, dog, and toy refer to concrete objects, they all belong together in the same (yet unnamed) category. According to this story,

[1]  Jerome A. Feldman,et al.  Some Decidability Results on Grammatical Inference and Complexity , 1972, Inf. Control..

[2]  M. R. Brent,et al.  Surface cues and robust inference as a basis for the early acquisition of subcategorization frames , 1994 .

[3]  Eric Brill,et al.  Deducing Linguistic Structure from the Statistics of Large Corpora , 1990, HLT.

[4]  M. H. Kelly,et al.  Stress in time. , 1988, Journal of experimental psychology. Human perception and performance.

[5]  J. Elman Learning and development in neural networks: the importance of starting small , 1993, Cognition.

[6]  Laurent Siklóssy,et al.  A language-learning heuristic program☆☆☆ , 1971 .

[7]  B. MacWhinney,et al.  The Child Language Data Exchange System: an update , 1990, Journal of Child Language.

[8]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[9]  M. Brent Advances in the computational study of language acquisition , 1996, Cognition.

[10]  L. Gleitman The Structural Sources of Verb Meanings , 2020, Sentence First, Arguments Afterward.

[11]  B. MacWhinney The CHILDES project: tools for analyzing talk , 1992 .

[12]  N. Ratner Patterns of vowel modification in mother–child speech , 1984, Journal of Child Language.

[13]  P. Jusczyk,et al.  Infants′ Detection of the Sound Patterns of Words in Fluent Speech , 1995, Cognitive Psychology.

[14]  M. H. Kelly,et al.  Using sound to solve syntactic problems: the role of phonology in grammatical category assignments. , 1992, Psychological review.

[15]  P. Jusczyk,et al.  A precursor of language acquisition in young infants , 1988, Cognition.

[16]  I. M. Schlesinger,et al.  Categories and Processes in Language Acquisition , 1990 .

[17]  Andreas Stolcke,et al.  Inducing Probabilistic Grammars by Bayesian Model Merging , 1994, ICGI.

[18]  W. Cooper,et al.  Speech timing of grammatical categories , 1978, Cognition.

[19]  M. Maratsos,et al.  The internal language of children's syntax : The ontogenesis and representation of syntactic categories , 1980 .

[20]  Eric Wanner,et al.  Language acquisition: the state of the art , 1982 .

[21]  C. C. Fries The structure of English;: An introduction to the construction of English sentences , 2005 .

[22]  Michael R. Brent,et al.  From Grammar to Lexicon: Unsupervised Learning of Lexical Syntax , 1993, Comput. Linguistics.

[23]  Zellig S. Harris,et al.  Methods in structural linguistics. , 1952 .

[24]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[25]  H. Ross Principles of Numerical Taxonomy , 1964 .

[26]  Richard W. Hamming,et al.  Coding and Information Theory , 2018, Feynman Lectures on Computation.

[27]  Lila R. Gleitman,et al.  A Picture Is Worth a Thousand Words, but That's the Problem: The Role of Syntax in Vocabulary Acquisition , 1992 .

[28]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[29]  Liliane Haegeman,et al.  Introduction to Government and Binding Theory , 1991 .

[30]  Elissa L. Newport,et al.  The role of constituent structure in the induction of an artificial language , 1981 .

[31]  Timothy A. Cartwright,et al.  Distributional regularity and phonotactics are useful for early lexical acquisition , 1996 .

[32]  S. Pinker Formal models of language learning , 1979, Cognition.

[33]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[34]  Eric Brill,et al.  Discovering the Lexical Features of a Language , 1991, ACL.

[35]  C. L. Baker,et al.  The Logical problem of language acquisition , 1984 .

[36]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[37]  M. H. Kelly,et al.  Phonological information for grammatical category assignments , 1991 .

[38]  George R. Kiss,et al.  Grammatical Word Classes: A Learning Process and its Simulation , 1973 .

[39]  Steven Pinker,et al.  Language learnability and language development , 1985 .