Simplicity: A cure for overgeneralizations in language acquisition?

Simplicity: A Cure for Overgeneralizations in Language Acquisition? Luca Onnis (l.onnis@ warwick.ac.uk) Department of Psychology, University of Warwick CV4 7AL Coventry, UK Matthew Roberts (m.roberts.2@ warwick.ac.uk) Department of Psychology, University of Warwick CV4 7AL Coventry, UK Nick Chater (nick.chater@warwick.ac.uk) Department of Psychology and Institute for Applied Cognitive Science, University of Warwick CV4 7AL Coventry, UK Abstract A formal model of learning as induction, the simplicity principle (e.g. Chater & Vitanyi , 2001) states that the cognitive system seeks the hypothesis that provides the briefest representation of the available data− here the linguistic input to the child. Data gathered from the CHILDES database were used as an approximation of positive input the child receives from adults. We considered linguistic structures that would yield overgeneralization, according to Baker’s paradox (Baker, 1979). A simplicity based simulation was run incorporating two different hypotheses about the grammar: (1) The child assumes that there are no exceptions to the grammar. This hypothesis leads to overgeneralization. (2) The child assumes that some constructions are not allowed. For small corpora of data, the first hypothesis produced a simpler representation. However, for larger corpora, the second hypothesis was preferred as it lead to a shorter input description and eliminated overgeneralization. Introduction Overgeneralizations are a common feature of language development. In learning the English past tense, children typically overgeneralize the ‘-ed’ rule, producing constructions such as we holded the baby rabbits (Pinker, 1995). Language learners recover from these errors, in spite of the lack of negative evidence and the infinity of allowable constructions that remain unheard; it has been argued that this favours the existence of a specific language-learning device (e.g. Chomsky, 1980; Pinker, 1989). This is an aspect of the ‘Poverty of the Stimulus’ argument. We report on a statistical model of language acquisition, which suggests that recovery from overgeneralizations may proceed from positive evidence alone. Specifically, we show that adult linguistic competence in quasi-regular structures may stem from an interaction between a general cognitive principle, simplicity (Chater, 1996) and statistical properties of the input. According to Baker’s Paradox (Baker, 1979) children are exposed to linguistic structures that they subsequently overgeneralize, demonstrating that they capture some general structure of the language. However, some generalizations are grammatically incorrect and children do not receive direct negative evidence from caretakers (e.g. corrections labeling such overgeneralizations as disallowed). The paradox is that non-occurrence is not in itself evidence for the incorrectness of a construction because an infinite number of unheard sentences are still correct. The irregularities that Baker referred to can be broadly labeled alternations (Levin, 1993; see also Culicover, 2000). For instance, the dative alternation in English allows a class of verbs to take both the double-object construction (He gave Mark the book) and the prepositional construction (He gave the book to Mark). Hence the verb give alternates between two constructions. However, certain verbs seem to be constrained to one possible construction only (He donated the book to Mark is allowed, whereas *He donated Mark the book is not). Such verbs are non- alternating. From empirical studies we know that children do make overgeneralization errors that involve alternations, such as *I said her no (by analogy to I told her no, Bowerman, 1996; Lord 1979). In this paper we present alternation phenomena from the CHILDES database (MacWhinney, 2000) of child-directed speech which will be used in the computer model. Secondly, we introduce the simplicity principle (Chater, 1996), based on the mathematical theory of Kolmogorov Complexity (Kolmogorov, 1965). Thirdly, we present an artificial language designed to model the CHILDES data, and describe simplicity-based models of language processing and the simulations of recovery from overgeneralizations. Lastly we discuss the limitations of this specific model and some implications for research on language acquisition.