Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings

In settings where only unlabeled speech data is available, speech technology needs to be developed without transcriptions, pronunciation dictionaries, or language modelling text. A similar problem is faced when modeling infant language acquisition. In these cases, categorical linguistic structure needs to be discovered directly from speech audio. We present a novel unsupervised Bayesian model that segments unlabeled speech and clusters the segments into hypothesized word groupings. The result is a complete unsupervised tokenization of the input speech in terms of discovered word types. In our approach, a potential word segment (of arbitrary length) is embedded in a fixed-dimensional acoustic vector space. The model, implemented as a Gibbs sampler, then builds a whole-word acoustic model in this space while jointly performing segmentation. We report word error rates in a small-vocabulary connected digit recognition task by mapping the unsupervised decoded output to ground truth transcriptions. The model achieves around 20% error rate, outperforming a previous HMM-based system by about 10% absolute. Moreover, in contrast to the baseline, our model does not require a pre-specified vocabulary size.

[1]  Aren Jansen,et al.  Segmental acoustic indexing for zero resource keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Philip Resnik,et al.  GIBBS SAMPLING FOR THE UNINITIATED , 2010 .

[3]  Florian Metze,et al.  The Spoken Web Search Task at MediaEval 2011 , 2012, ICASSP.

[4]  Aren Jansen,et al.  Fully unsupervised small-vocabulary speech recognition using a segmental Bayesian model , 2015, INTERSPEECH.

[5]  Geoffrey Zweig,et al.  SCARF: a segmental conditional random field toolkit for speech recognition , 2010, INTERSPEECH.

[6]  Hugo Van hamme,et al.  Joint training of non-negative Tucker decomposition and discrete density hidden Markov models , 2013, Comput. Speech Lang..

[7]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[8]  Michael J. Black,et al.  A nonparametric Bayesian alternative to spike sorting , 2008, Journal of Neuroscience Methods.

[9]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[11]  Daniel P. W. Ellis,et al.  Frequency-domain linear prediction for temporal features , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[12]  Giampiero Salvi,et al.  Pattern discovery in continuous speech using Block Diagonal Infinite HMM , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Aren Jansen,et al.  Efficient spoken term discovery using randomized algorithms , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[14]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[15]  Chia-ying Lee,et al.  Discovering linguistic structures in speech: models and applications , 2014 .

[16]  Aren Jansen,et al.  Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[17]  Morten H. Christiansen,et al.  Learning to Segment Speech Using Multiple Cues: A Connectionist Model , 1998 .

[18]  S. L. Scott Bayesian Methods for Hidden Markov Models , 2002 .

[19]  Michael R. Brent,et al.  An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery , 1999, Machine Learning.

[20]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[21]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[22]  Bhiksha Raj,et al.  Unsupervised word segmentation from noisy input , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[23]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[24]  Karen Livescu,et al.  Deep convolutional acoustic word embeddings using word-pair side information , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Simon King,et al.  Unsupervised lexical clustering of speech segments using fixed-dimensional acoustic embeddings , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[26]  Thomas L. Griffiths,et al.  Learning phonetic categories by learning a lexicon , 2009 .

[27]  Hugo Van hamme,et al.  Discovering Phone Patterns in Spoken Utterances by Non-Negative Matrix Factorization , 2008, IEEE Signal Processing Letters.

[28]  Anand Venkataraman,et al.  A Statistical Model for Word Discovery in Transcribed Speech , 2001, CL.

[29]  Okko Johannes Räsänen,et al.  Computational modeling of phonetic and lexical learning in early language acquisition: Existing models and future directions , 2012, Speech Commun..

[30]  Micha Elsner,et al.  A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability , 2013, EMNLP.

[31]  Hung-An Chang,et al.  Resource configurable spoken query detection using Deep Boltzmann Machines , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  James R. Glass,et al.  Unsupervised Lexicon Discovery from Acoustic Input , 2015, TACL.

[33]  Tatsuya Kawahara,et al.  Learning a language model from continuous speech , 2010, INTERSPEECH.

[34]  David Barber,et al.  Bayesian reasoning and machine learning , 2012 .

[35]  Bhiksha Raj,et al.  A hierarchical system for word discovery exploiting DTW-based initialization , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[36]  Thomas L. Griffiths,et al.  A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.

[37]  J. Morgan,et al.  Mommy and Me , 2005, Psychological science.

[38]  Aren Jansen,et al.  Weak top-down constraints for unsupervised acoustic model training , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Lin-Shan Lee,et al.  Unsupervised discovery of linguistic structure including two-level acoustic patterns using three cascaded stages of iterative optimization , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[41]  Naonori Ueda,et al.  Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling , 2009, ACL.

[42]  Bert Cranen,et al.  A computational model for unsupervised word discovery , 2007, INTERSPEECH.

[43]  Michael C. Frank,et al.  Unsupervised word discovery from speech using automatic segmentation into syllable-like units , 2015, INTERSPEECH.

[44]  Tatsuya Kawahara,et al.  Bayesian Learning of a Language Model from Continuous Speech , 2012, IEICE Trans. Inf. Syst..

[45]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.