Nonparametric bayesian models of lexical acquisition

The child learning language is faced with a daunting task: to learn to extract meaning from an apparently meaningless stream of sound. This thesis rests on the assumption that the kinds of generalizations the learner may make are constrained by the interaction of many different types of stochastic information, including innate learning biases. I use computational modeling to investigate how the generalizations made by unsupervised learners are affected by the sources of information available to them. I adopt a Bayesian perspective, where both internal representations of language and any learning biases are made explicit. I begin by presenting a generic framework for language modeling based on nonparametric Bayesian statistics, where model complexity grows with the amount of input data. This framework divides the work of modeling between a generator, which generates lexical items, and an adaptor, which generates frequencies for those items. Separating the two tasks in this way makes the framework flexible, allowing individual components to be easily modified. Standard sampling methods, such as Gibbs or Metropolis-Hastings sampling, may be used for inference. Using this framework, I develop several specific models to investigate questions related to morphological acquisition (identifying stems and suffixes) and word segmentation (identifying word boundaries in phonemically transcribed speech). I apply these models to English corpora of newspaper text and phonemically transcribed child-directed speech. With regard to morphology, my experiments provide evidence that morphological information is learned better from word types than from word tokens. With regard to word segmentation, my results indicate that assuming independence between words (as many previous models have done) leads to undersegmentation of the data. Accounting for local context improves segmentation markedly and yields better results than previous models. I conclude by describing briefly how the models presented here can be extended in order to account for a wider range of linguistic phenomena, including phonetic variability and the relationship between morphology and syntactic class.

[1]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[2]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[3]  Peter Willett,et al.  The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data , 1992, J. Am. Soc. Inf. Sci..

[4]  P. Smolensky,et al.  Optimality Theory: Constraint Interaction in Generative Grammar , 2004 .

[5]  H. Simon,et al.  ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[6]  Z. Harris From Phoneme to Morpheme , 1955 .

[7]  S. Pinker,et al.  On language and connectionism: Analysis of a parallel distributed processing model of language acquisition , 1988, Cognition.

[8]  Walter Reviewer-Daelemans Review of Learnability in optimality theory by Bruce Tesar and Paul Smolensky. The MIT Press 2000. , 2001 .

[9]  J. Hay Lexical frequency in morphology: Is everything relative? , 2001 .

[10]  J. Pitman,et al.  The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .

[11]  T. Poggio,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 2001 .

[12]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[13]  P. Jusczyk,et al.  Sensitivity to discontinuous dependencies in language learners: evidence for limitations in processing space , 1998, Cognition.

[14]  Lillian Lee,et al.  Mostly-Unsupervised Statistical Segmentation of Japanese: Applications to Kanji , 2000, ANLP.

[15]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[16]  Bezalel Elan Dresher,et al.  Charting the Learning Path: Cues to Parameter Setting , 1999, Linguistic Inquiry.

[17]  M. Brent,et al.  The role of exposure to isolated words in early vocabulary development , 2001, Cognition.

[18]  Michael Mitzenmacher,et al.  A Brief History of Generative Models for Power Law and Lognormal Distributions , 2004, Internet Math..

[19]  Yee Whye Teh,et al.  A Bayesian Interpretation of Interpolated Kneser-Ney , 2006 .

[20]  Carl de Marcken,et al.  Unsupervised language acquisition , 1996, ArXiv.

[21]  R. Brown,et al.  A First Language , 1973 .

[22]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[23]  D. Aldous Exchangeability and related topics , 1985 .

[24]  J. Elman An alternative view of the mental lexicon , 2004, Trends in Cognitive Sciences.

[25]  Brian Roark,et al.  Prosodic constraints and the learner’s environment: a corpus study , 2000 .

[26]  James L. McClelland,et al.  On learning the past-tenses of English verbs: implicit rules or parallel distributed processing , 1986 .

[27]  Radford M. Neal A new view of the EM algorithm that justifies incremental and other variants , 1993 .

[28]  Morten H. Christiansen,et al.  Integrating Multiple Cues in Word Segmentation: A Connectionist Model using Hints , 1996 .

[29]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[30]  E. Newport,et al.  Learning at a distance I. Statistical learning of non-adjacent dependencies , 2004, Cognitive Psychology.

[31]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[32]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[33]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[34]  Matthew G. Snover,et al.  A Probabilistic Model for Learning Concatenative Morphology , 2002, NIPS.

[35]  Maosong Sun,et al.  Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data , 2022, International Conference on Computational Linguistics.

[36]  Richard Sproat,et al.  A statistical method for finding word boundaries in Chinese text , 1990 .

[37]  Mathias Creutz Unsupervised Segmentation of Words Using Prior Distributions of Morph Length and Frequency , 2003, ACL.

[38]  T. Mark Ellison,et al.  The Iterative Learning of Phonological Constraints , 2007 .

[39]  Sean A. Fulop,et al.  Unsupervised Learning of Morphology Without Morphemes , 2002, SIGMORPHON.

[40]  Bruce Tesar,et al.  Learnability in Optimality Theory (long version) , 1996 .

[41]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[42]  David C. Plaut,et al.  Are non-semantic morphological effects incompatible with a distributed connectionist approach to lexical processing? , 2000 .

[43]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[44]  Thomas L. Griffiths,et al.  Bayesian Inference for PCFGs via Markov Chain Monte Carlo , 2007, NAACL.

[45]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[46]  Yu Hu,et al.  Using Morphology and Syntax Together in Unsupervised Learning , 2005 .

[47]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[48]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[49]  Elizabeth K. Johnson,et al.  Word Segmentation by 8-Month-Olds: When Speech Cues Count More Than Statistics , 2001 .

[50]  Mark Johnson,et al.  Priors in Bayesian Learning of Phonological Rules , 2004, SIGMORPHON@ACL.

[51]  Paul R. Cohen,et al.  An Algorithm for Segmenting Categorical Time Series into Meaningful Episodes , 2001, IDA.

[52]  J. Pitman Exchangeable and partially exchangeable random partitions , 1995 .

[53]  E. Newport,et al.  WORD SEGMENTATION : THE ROLE OF DISTRIBUTIONAL CUES , 1996 .

[54]  C Snow,et al.  Child language data exchange system , 1984, Journal of Child Language.

[55]  Charles Yang,et al.  Mechanisms and Constraints in Word Segmentation , 2005 .

[56]  Marco Baroni,et al.  Distribution-driven morpheme discovery: a computational/experimental study , 2003 .

[57]  Carl de Marcken,et al.  The Unsupervised Acquisition of a Lexicon from Continuous Speech , 1995, ArXiv.

[58]  Ann M. Peters,et al.  The Units of Language Acquisition , 1983 .

[59]  B. Dresher,et al.  A computational learning model for metrical phonology , 1990, Cognition.

[60]  Wessel Kraaij,et al.  Viewing stemming as recall enhancement , 1996, SIGIR '96.

[61]  T. Griffiths,et al.  Modeling individual differences using Dirichlet processes , 2006 .

[62]  Morten H. Christiansen,et al.  The power of statistical learning: No need for algebraic rules , 2020, Proceedings of the Twenty First Annual Conference of the Cognitive Science Society.

[63]  Mike Dowman,et al.  Addressing the Learnability of Verb Subcategorization with Bayesian Inference , 2000 .

[64]  B. Hayes,et al.  Rules vs. analogy in English past tenses: a computational/experimental study , 2003, Cognition.

[65]  Glenn Carroll,et al.  Two Experiments on Learning Probabilistic Dependency Grammars from Corpora , 1992 .

[66]  David Kauchak,et al.  Modeling word burstiness using the Dirichlet distribution , 2005, ICML.

[67]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[68]  P. Jusczyk,et al.  Phonotactic and Prosodic Effects on Word Segmentation in Infants , 1999, Cognitive Psychology.

[69]  P. Boersma,et al.  Empirical Tests of the Gradual Learning Algorithm , 2001, Linguistic Inquiry.

[70]  Keh-Yih Su,et al.  An Unsupervised Iterative Method for Chinese New Lexicon Extraction , 1997, ROCLING/IJCLCLP.

[71]  Peter Green,et al.  Markov chain Monte Carlo in Practice , 1996 .

[72]  N. Chater,et al.  Bootstrapping Word Boundaries: A Bottom-up Corpus-Based Approach to Speech Segmentation , 1997, Cognitive Psychology.

[73]  Chilin Shih,et al.  A Stochastic Finite-State Word-Segmentation Algorithm for Chinese , 1994, ACL.

[74]  Lancelot F. James,et al.  Generalized weighted Chinese restaurant processes for species sampling mixture models , 2003 .

[75]  R N Aslin,et al.  Statistical Learning by 8-Month-Old Infants , 1996, Science.

[76]  J. Elman,et al.  Rethinking Innateness: A Connectionist Perspective on Development , 1996 .

[77]  T. A. Cartwright,et al.  Distributional regularity and phonotactic constraints are useful for segmentation , 1996, Cognition.

[78]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[79]  Michael J. Black,et al.  A Non-Parametric Bayesian Approach to Spike Sorting , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[80]  S. Crain Language acquisition in the absence of experience , 1991, Behavioral and Brain Sciences.

[81]  Paul Boersma,et al.  Gradual constraint-ranking learning algorithm predicts acquisition order , 1999 .

[82]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[83]  Emile H. L. Aarts,et al.  Simulated annealing and Boltzmann machines - a stochastic approach to combinatorial optimization and neural computing , 1990, Wiley-Interscience series in discrete mathematics and optimization.

[84]  LouAnn Gerken,et al.  Decisions, decisions: infant language learning when multiple generalizations are possible , 2006, Cognition.

[85]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[86]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[87]  Gökhan Tür,et al.  Statistical Morphological Disambiguation for Agglutinative Languages , 2000, COLING.

[88]  Ossama Emam,et al.  Language Model Based Arabic Word Segmentation , 2003, ACL.

[89]  Marco Baroni,et al.  Unsupervised discovery of morphologically related words based on orthographic and semantic similarity , 2002, SIGMORPHON.

[90]  Jan Hajic,et al.  Morphological Tagging: Data vs. Dictionaries , 2000, ANLP.

[91]  John Goldsmith,et al.  An algorithm for the unsupervised learning of morphology , 2006, Natural Language Engineering.

[92]  Thomas L. Griffiths,et al.  Interpolating between types and tokens by estimating power-law generators , 2005, NIPS.

[93]  P. Boersma How we learn variation, optionality and probalility , 1997 .

[94]  Padhraic Smyth,et al.  Discovering Chinese Words from Unsegmented Text , 1999, SIGIR 1999.

[95]  Anand Venkataraman,et al.  A Statistical Model for Word Discovery in Transcribed Speech , 2001, CL.

[96]  Michael Gasser,et al.  The Emergence of Words , 2001 .

[97]  Julian Besag,et al.  Markov Chain Monte Carlo for Statistical Inference , 2002 .

[98]  P. Jusczyk,et al.  Infants’ sensitivity to allophonic cues for word segmentation , 1999, Perception & psychophysics.

[99]  Morten H. Christiansen,et al.  Learning to Segment Speech Using Multiple Cues: A Connectionist Model , 1998 .

[100]  Dale Schuurmans,et al.  Self-Supervised Chinese Word Segmentation , 2001, IDA.

[101]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[102]  D. Norris Shortlist: a connectionist model of continuous speech recognition , 1994, Cognition.

[103]  Mark S. Seidenberg,et al.  Explaining derivational morphology as the convergence of codes , 2000, Trends in Cognitive Sciences.

[104]  B. Hayes Metrical Stress Theory: Principles and Case Studies , 1995 .

[105]  J. Tenenbaum,et al.  Optimal Predictions in Everyday Cognition , 2006, Psychological science.

[106]  Todd M. Bailey,et al.  Determinants of wordlikeness: Phonotactics or lexical neighborhoods? , 2001 .

[107]  Simon J. Godsill,et al.  On sequential Monte Carlo sampling methods for Bayesian filtering , 2000, Stat. Comput..

[108]  Jeffrey L. Elman,et al.  Generalization from Sparse Input , 2003 .

[109]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[110]  Bruce Hayes,et al.  Modeling English Past Tense Intuitions with Minimal Generalization , 2002, SIGMORPHON.

[111]  Alon Lavie,et al.  Unsupervised Induction of Natural Language Morphology Inflection Classes , 2004, SIGMORPHON@ACL.

[112]  Mathias Creutz,et al.  Induction of a Simple Morphology for Highly-Inflecting Languages , 2004, SIGMORPHON@ACL.

[113]  Grover Hudson,et al.  PHONOLOGY AND LANGUAGE USE , 2004 .

[114]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[115]  Daniel Jurafsky,et al.  Knowledge-Free Induction of Inflectional Morphologies , 2001, NAACL.

[116]  Noam Chomsky,et al.  वाक्यविन्यास का सैद्धान्तिक पक्ष = Aspects of the theory of syntax , 1965 .

[117]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984, ACL.

[118]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[119]  Mary R. Newsome,et al.  The Beginnings of Word Segmentation in English-Learning Infants , 1999, Cognitive Psychology.

[120]  Charles Elkan,et al.  Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution , 2006, ICML.

[121]  Kent Johnson Gold’s Theorem and Cognitive Science* , 2004, Philosophy of Science.

[122]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[123]  D. Pisoni,et al.  Recognizing Spoken Words: The Neighborhood Activation Model , 1998, Ear and hearing.

[124]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[125]  Xiaotie Deng,et al.  Accessor Variety Criteria for Chinese Word Extraction , 2004, CL.

[126]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[127]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[128]  Elissa L. Newport,et al.  The distributional structure of grammatical categories in speech to young children , 2002, Cogn. Sci..

[129]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[130]  J. Pind The Discovery of Spoken Language, Peter W. Jusczyk (Ed.). MIT Press (1997), ISBN 0 262 10058 4 , 1997 .

[131]  Thomas Roeper,et al.  Theoretical Issues in Language Acquisition : Continuity and Change in Development , 1992 .

[132]  J. Pierrehumbert Stochastic phonology , 2001 .

[133]  M. West,et al.  Hyperparameter estimation in Dirichlet process mixture models , 1992 .

[134]  J. V. Rauff,et al.  Finite State Morphology , 2007 .

[135]  Lauri Karttunen,et al.  Two-Level Morphology with Composition , 1992, COLING.

[136]  Richard Wicentowski Multilingual Noise-Robust Supervised Morphological Analysis using the WordFrame Model , 2004, SIGMORPHON@ACL.

[137]  P. Luce,et al.  Probabilistic Phonotactics and Neighborhood Activation in Spoken Word Recognition , 1999 .

[138]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[139]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[140]  James L. Morgan,et al.  Negative Evidence on Negative Evidence , 2004 .

[141]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[142]  David Yarowsky,et al.  Minimally Supervised Morphological Analysis by Multimodal Alignment , 2000, ACL.

[143]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[144]  W. Marslen-Wilson Functional parallelism in spoken word-recognition , 1987, Cognition.

[145]  Erik D. Thiessen,et al.  When cues collide: use of stress and statistical cues to word boundaries by 7- to 9-month-old infants. , 2003, Developmental psychology.

[146]  J. Tenenbaum,et al.  Word learning as Bayesian inference. , 2007, Psychological review.

[147]  Carl E. Rasmussen,et al.  Factorial Hidden Markov Models , 1997 .

[148]  Michael R. Brent,et al.  An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery , 1999, Machine Learning.

[149]  Daniel Swingley,et al.  Statistical clustering and the contents of the infant vocabulary , 2005, Cognitive Psychology.

[150]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[151]  James L. McClelland,et al.  The TRACE model of speech perception , 1986, Cognitive Psychology.

[152]  Noam Chomsky,et al.  Lectures on Government and Binding , 1981 .

[153]  Mathias Creutz,et al.  INDUCING THE MORPHOLOGICAL LEXICON OF A NATURAL LANGUAGE FROM UNANNOTATED TEXT , 2005 .

[154]  Eleanor Olds Batchelder,et al.  Bootstrapping the lexicon: A computational model of infant speech segmentation , 2002, Cognition.

[155]  Michael,et al.  On a Class of Bayesian Nonparametric Estimates : I . Density Estimates , 2008 .

[156]  S. Goldinger,et al.  Phonetic priming, neighborhood activation, and PARSYN , 2000, Perception & psychophysics.

[157]  Jan Hajic,et al.  Tagging Inflective Languages: Prediction of Morphological Categories for a Rich Structured Tagset , 1998, ACL.