Comparing Models of Phonotactics for Word Segmentation

Developmental research indicates that infants use low-level statistical regularities, or phonotactics, to segment words from continuous speech. In this paper, we present a segmentation framework that enables the direct comparison of different phonotactic models for segmentation. We compare a model using phoneme transitional probabilities, which have been widely used in computational models, to syllable-based bigram models, which have played a prominent role in the developmental literature. We also introduce a novel estimation method, and compare it to other strategies for estimating the parameters of the phonotactic models from unsegmented data. The results show that syllable-based models outperform the phoneme models, specifically in the context of improved unsupervised parameter estimation. The syllablebased transitional probability model achieves a word token f-score of nearly 80%, the highest reported performance for a phonotactic segmentation model with no lexicon.

[1]  Charles Yang,et al.  Recession Segmentation: Simpler Online Word Segmentation Using Limited Resources , 2010, CoNLL.

[2]  Adam Albright,et al.  Feature-based generalisation as a source of gradient acceptability* , 2009, Phonology.

[3]  L. Fenson,et al.  Lexical development norms for young children , 1996 .

[4]  Bruce Hayes,et al.  Explaining sonority projection effects* , 2011, Phonology.

[5]  Mark Johnson,et al.  Unsupervised Word Segmentation for Sesotho Using Adaptor Grammars , 2008, SIGMORPHON.

[6]  Mark Johnson,et al.  Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars , 2009, NAACL.

[7]  Jeffrey Heinz,et al.  Modeling the contribution of phonotactic cues to the problem of word segmentation. , 2010, Journal of child language.

[8]  Jeffrey Heinz,et al.  Improving Word Segmentation by Simultaneously Learning Phonotactics , 2008, CoNLL.

[9]  Charles D. Yang Universal Grammar, statistics or both? , 2004, Trends in Cognitive Sciences.

[10]  Erik D. Thiessen,et al.  When cues collide: use of stress and statistical cues to word boundaries by 7- to 9-month-old infants. , 2003, Developmental psychology.

[11]  Michael R. Brent,et al.  An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery , 1999, Machine Learning.

[12]  Daniel Swingley,et al.  Statistical clustering and the contents of the infant vocabulary , 2005, Cognitive Psychology.

[13]  Sarah C. Creel,et al.  Distant melodies: statistical learning of nonadjacent dependencies in tone sequences. , 2004, Journal of experimental psychology. Learning, memory, and cognition.

[14]  E. Newport,et al.  Computation of Conditional Probability Statistics by 8-Month-Old Infants , 1998 .

[15]  P. Jusczyk,et al.  Phonotactic cues for segmentation of fluent speech by infants , 2001, Cognition.

[16]  P. Jusczyk,et al.  Phonotactic and Prosodic Effects on Word Segmentation in Infants , 1999, Cognitive Psychology.

[17]  Robert Daland,et al.  Learning Diphone-Based Segmentation , 2011, Cogn. Sci..

[18]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[19]  Jessica F. Hay,et al.  Learning in reverse: Eight-month-old infants track backward transitional probabilities , 2009, Cognition.

[20]  E. Newport,et al.  Learning at a distance I. Statistical learning of non-adjacent dependencies , 2004, Cognitive Psychology.

[21]  Anand Venkataraman,et al.  A Statistical Model for Word Discovery in Transcribed Speech , 2001, CL.

[22]  Mark Johnson,et al.  Using Adaptor Grammars to Identify Synergies in the Unsupervised Acquisition of Linguistic Structure , 2008, ACL.

[23]  Bruce Hayes,et al.  A Maximum Entropy Model of Phonotactics and Phonotactic Learning , 2008, Linguistic Inquiry.

[24]  R. Blonna,et al.  Learning at a Distance , 2001 .

[25]  Gaja Jarosz,et al.  The Richness of Distributional Cues to Word Boundaries in Speech to Young Children , 2013 .

[26]  James H. Martin,et al.  Speech and Language Processing, 2nd Edition , 2008 .

[27]  Stephen A. Hockema,et al.  Finding Words in Speech: An Investigation of American English , 2006 .

[28]  P. Jusczyk,et al.  Infants' sensitivity to phonotactic patterns in the native language. , 1994 .

[29]  M. Goldsmith,et al.  Statistical Learning by 8-Month-Old Infants , 1996 .

[30]  R. Kager,et al.  Adding Generalization to Statistical Learning: The Induction of Phonotactics from Continuous Speech. , 2010 .