Speech segmentation with a neural encoder model of working memory

We present the first unsupervised LSTM speech segmenter as a cognitive model of the acquisition of words from unsegmented input. Cognitive biases toward phonological and syntactic predictability in speech are rooted in the limitations of human memory (Baddeley et al., 1998); compressed representations are easier to acquire and retain in memory. To model the biases introduced by these memory limitations, our system uses an LSTM-based encoder-decoder with a small number of hidden units, then searches for a segmentation that minimizes autoencoding loss. Linguistically meaningful segments (e.g. words) should share regular patterns of features that facilitate decoder performance in comparison to random segmentations, and we show that our learner discovers these patterns when trained on either phoneme sequences or raw acoustics. To our knowledge, ours is the first fully unsupervised system to be able to segment both symbolic and acoustic representations of speech.

[1]  D. Swingley,et al.  At 6–9 months, human infants know the meanings of many common nouns , 2012, Proceedings of the National Academy of Sciences.

[2]  Anne Cutler,et al.  Recognition and Representation of Function Words in English-Learning Infants , 2006 .

[3]  Margaret M. Fleck Lexicalized Phonotactic Word Segmentation , 2008, ACL.

[4]  Thomas L. Griffiths,et al.  Learning phonetic categories by learning a lexicon , 2009 .

[5]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  P. Jusczyk,et al.  Infants′ Detection of the Sound Patterns of Words in Fluent Speech , 1995, Cognitive Psychology.

[7]  James R. Glass A probabilistic framework for segment-based speech recognition , 2003, Comput. Speech Lang..

[8]  Aren Jansen,et al.  Unsupervised neural network based feature extraction using weak top-down constraints , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  James R. Glass,et al.  Unsupervised Lexicon Discovery from Acoustic Input , 2015, TACL.

[10]  Ryan Cotterell,et al.  Modeling Word Forms Using Latent Underlying Morphs and Phonology , 2015, TACL.

[11]  Mark Johnson,et al.  Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars , 2009, NAACL.

[12]  Mark Johnson,et al.  Why Doesn’t EM Find Good HMM POS-Taggers? , 2007, EMNLP.

[13]  Michael C. Frank,et al.  Unsupervised word discovery from speech using automatic segmentation into syllable-like units , 2015, INTERSPEECH.

[14]  P. Hallé,et al.  The role of accentual pattern in early lexical representation , 2004 .

[15]  A. Baddeley,et al.  The phonological loop as a language learning device. , 1998, Psychological review.

[16]  Mark Johnson,et al.  Exploring the Role of Stress in Bayesian Word Segmentation using Adaptor Grammars , 2014, TACL.

[17]  Constantine Lignos,et al.  Modeling Infant Word Segmentation , 2011, CoNLL.

[18]  Jeffrey Pennington,et al.  Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions , 2011, EMNLP.

[19]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[20]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[21]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[22]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[23]  Steven T. Piantadosi,et al.  The communicative function of ambiguity in language , 2011, Cognition.

[24]  C. Anton Rytting,et al.  Segmenting words from natural speech: subsegmental variation in segmental cues. , 2010, Journal of child language.

[25]  Yoshua Bengio,et al.  Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.

[26]  Robert Daland,et al.  Learning Diphone-Based Segmentation , 2011, Cogn. Sci..

[27]  Aren Jansen,et al.  An evaluation of graph clustering methods for unsupervised term discovery , 2015, INTERSPEECH.

[28]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[29]  James L. McClelland,et al.  Unsupervised learning of vowel categories from infant-directed speech , 2007, Proceedings of the National Academy of Sciences.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[32]  Yee Whye Teh,et al.  Beam sampling for the infinite hidden Markov model , 2008, ICML '08.

[33]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Michael R. Brent,et al.  An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery , 1999, Machine Learning.

[35]  Daniel Swingley,et al.  Statistical clustering and the contents of the infant vocabulary , 2005, Cognitive Psychology.

[36]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[37]  A. Baddeley Working Memory, Thought, and Action , 2007 .

[38]  Naonori Ueda,et al.  Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling , 2009, ACL.

[39]  Bob McMurray,et al.  Learning During Processing: Word Learning Doesn't Wait for Word Recognition to Finish. , 2017, Cognitive science.

[40]  J. Morgan,et al.  Mommy and Me , 2005, Psychological science.

[41]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[42]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[43]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[44]  P. Mermelstein,et al.  Segmentation of speech into syllabic units , 1974 .

[45]  Jürgen Schmidhuber,et al.  Unsupervised Learning in LSTM Recurrent Neural Networks , 2001, ICANN.

[46]  C. Anton Rytting,et al.  Preserving subsegmental variation in modeling word segmentation (or, the raising of baby Mondegreen) , 2007 .

[47]  Morten H. Christiansen,et al.  Learning to Segment Speech Using Multiple Cues: A Connectionist Model , 1998 .

[48]  M. D’Esposito Working memory. , 2008, Handbook of clinical neurology.

[49]  Tatsuya Kawahara,et al.  Learning a language model from continuous speech , 2010, INTERSPEECH.

[50]  Aren Jansen,et al.  A segmental framework for fully-unsupervised large-vocabulary speech recognition , 2016, Comput. Speech Lang..

[51]  Micha Elsner,et al.  A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability , 2013, EMNLP.