Unsupervised Word Discovery from Phonetic Input Using Nested Pitman-Yor Language Modeling

In this paper we consider the unsupervised word discovery from phonetic input. We employ a word segmentation algorithm which simultaneously develops a lexicon, i.e., the transcription of a word in terms of a phone sequence, learns a n-gram language model describing word and word sequence probabilities, and carries out the segmentation itself. The underlying statistical model is that of a Pitman-Yor process, a concept known from Bayesian non-parametrics, which allows for an a priori unknown and unlimited number of different words. Using a hierarchy of Pitman-Yor processes, language models of different order can be employed and nesting it with another hierarchy of Pitman-Yor processes on the phone level allows for backing off unknown word unigrams by phone m-grams. We present results on a large-vocabulary task, assuming an error-free phone sequence is given. We finish by discussing options how to cope with noisy phone sequences.

[1]  Joerg Schmalenstroeer,et al.  A NOVEL INITIALIZATION METHOD FOR UNSUPERVISED LEARNING OF ACOUSTIC PATTERNS IN SPEECH DEPARTMENT OF COMMUNICATIONS ENGINEERING TECHNICAL REPOR T FGNT-2013-01 , 2013 .

[2]  Naonori Ueda,et al.  Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling , 2009, ACL.

[3]  Mike E. Davies,et al.  Latent Variable Analysis and Signal Separation , 2010 .

[4]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[5]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Tatsuya Kawahara,et al.  Bayesian Learning of a Language Model from Continuous Speech , 2012, IEICE Trans. Inf. Syst..

[7]  Reinhold Häb-Umbach,et al.  Unsupervised Learning of Acoustic Events Using Dynamic Time Warping and Hierarchical K-Means++ Clustering , 2011, INTERSPEECH.

[8]  Yee Whye Teh,et al.  A Bayesian Interpretation of Interpolated Kneser-Ney , 2006 .

[9]  Maya R. Gupta,et al.  Introduction to the Dirichlet Distribution and Related Processes , 2010 .

[10]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[11]  Bhiksha Raj,et al.  Unsupervised Structure Discovery for Semantic Analysis of Audio , 2012, NIPS.

[12]  J. Pitman,et al.  The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .

[13]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[14]  Hugo Van hamme,et al.  Unsupervised learning of time-frequency patches as a noise-robust representation of speech , 2009, Speech Commun..

[15]  Hirokazu Kameoka,et al.  Nonnegative Matrix Factorization with Markov-Chained Bases for Modeling Time-Varying Patterns in Music Spectrograms , 2010, LVA/ICA.

[16]  Reinhold Haeb-Umbach,et al.  A Novel Initialization Method for Unsupervised Learning of Acoustic Patterns in Speech (FGNT-2013-01) , 2013 .

[17]  Barak A. Pearlmutter,et al.  Convolutive Non-Negative Matrix Factorisation with a Sparseness Constraint , 2006, 2006 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing.

[18]  Tadahiro Taniguchi,et al.  Double articulation analyzer for unsegmented human motion using Pitman-Yor language model and infinite hidden Markov model , 2011, 2011 IEEE/SICE International Symposium on System Integration (SII).

[19]  Yee Whye Teh,et al.  A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[20]  Nuria Oliver,et al.  Spoken WordCloud: Clustering recurrent patterns in speech , 2011, 2011 9th International Workshop on Content-Based Multimedia Indexing (CBMI).

[21]  Micha Elsner,et al.  Bootstrapping a Unified Model of Lexical and Phonetic Acquisition , 2012, ACL.