Unsupervised discovery of linguistic structure including two-level acoustic patterns using three cascaded stages of iterative optimization

Techniques for unsupervised discovery of acoustic patterns are getting increasingly attractive, because huge quantities of speech data are becoming available but manual annotations remain hard to acquire. In this paper, we propose an approach for unsupervised discovery of linguistic structure for the target spoken language given raw speech data. This linguistic structure includes two-level (subword-like and word-like) acoustic patterns, the lexicon of word-like patterns in terms of subword-like patterns and the N-gram language model based on word-like patterns. All patterns, models, and parameters can be automatically learned from the unlabelled speech corpus. This is achieved by an initialization step followed by three cascaded stages for acoustic, linguistic, and lexical iterative optimization. The lexicon of word-like patterns defines allowed consecutive sequence of HMMs for subword-like patterns. In each iteration, model training and decoding produces updated labels from which the lexicon and HMMs can be further updated. In this way, model parameters and decoded labels are respectively optimized in each iteration, and the knowledge about the linguistic structure is learned gradually layer after layer. The proposed approach was tested in preliminary experiments on a corpus of Mandarin broadcast news, including a task of spoken term detection with performance compared to a parallel test using models trained in a supervised way. Results show that the proposed system not only yields reasonable performance on its own, but is also complimentary to existing large vocabulary ASR systems.

[1]  Shigeki Sagayama,et al.  A successive state splitting algorithm for efficient allophone modeling , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  M. Ostendorf,et al.  Maximum likelihood successive state splitting , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[3]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[4]  Gilles Bertrand,et al.  Topological gray-scale watershed transformation , 1997, Optics & Photonics.

[5]  Hsinchun Chen,et al.  Updateable PAT-Tree Approach to Chinese Key PhraseExtraction using Mutual Information: A Linguistic Foundation for Knowledge Management , 1999 .

[6]  Fang Zheng A syllable-synchronous network search algorithm for word decoding in Chinese speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[7]  Myoung-Wan Koo,et al.  Speech recognition and utterance verification based on a generalized confidence score , 2001, IEEE Trans. Speech Audio Process..

[8]  Herbert Gish,et al.  Keyword Spotting of Arbitrary Words Using Minimal Speech Resources , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[9]  Ellen M. Voorhees,et al.  Overview of the TREC 2006 , 2007, TREC.

[10]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[11]  Sridha Sridharan,et al.  A phonetic search approach to the 2006 NIST spoken term detection evaluation , 2007, INTERSPEECH.

[12]  Sanjeev Khudanpur,et al.  Automatically learning speaker-independent acoustic subword units , 2008, INTERSPEECH.

[13]  Ryosuke Isotani,et al.  Open-vocabulary spoken-document retrieval based on query expansion using related web documents , 2008, INTERSPEECH.

[14]  Nobuaki Minematsu,et al.  Unsupervised optimal phoneme segmentation: Objectives, algorithm and comparisons , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Unto K. Laine,et al.  Computational language acquisition by statistical bottom-up processing , 2008, INTERSPEECH.

[16]  Herbert Gish,et al.  Unsupervised training of an HMM-based speech recognizer for topic classification , 2009, INTERSPEECH.

[17]  Timothy J. Hazen,et al.  A comparison of query-by-example methods for spoken term detection , 2009, INTERSPEECH.

[18]  Unto K. Laine,et al.  A noise robust method for pattern discovery in quantized time series: the concept matrix approach , 2009, INTERSPEECH.

[19]  Unto K. Laine,et al.  Self-learning vector quantization for pattern discovery from speech , 2009, INTERSPEECH.

[20]  Richard M. Schwartz,et al.  Unsupervised acoustic and language model training with small amounts of labelled data , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Herbert Gish,et al.  Improved topic classification and keyword discovery using an HMM-based speech recognizer trained without supervision , 2010, INTERSPEECH.

[22]  Lin-Shan Lee,et al.  Performance Analysis for Lattice-Based Speech Indexing Approaches Using Words and Subword Units , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Haizhou Li,et al.  An acoustic segment model approach to incorporating temporal information into speaker modeling for text-independent speaker recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Okko Johannes Räsänen Fully unsupervised word learning from continuous speech using transitional probabilities of atomic acoustic events , 2010, INTERSPEECH.

[25]  Kenneth Ward Church,et al.  Towards spoken term discovery at scale with zero resources , 2010, INTERSPEECH.

[26]  James R. Glass,et al.  A Piecewise Aggregate Approximation Lower-Bound Estimate for Posteriorgram-Based Dynamic Time Warping , 2011, INTERSPEECH.

[27]  Aren Jansen,et al.  Towards Unsupervised Training of Speaker Independent Acoustic Models , 2011, INTERSPEECH.

[28]  Lin-Shan Lee,et al.  Unsupervised Hidden Markov Modeling of Spoken Queries for Spoken Term Detection without Speech Recognition , 2011, INTERSPEECH.

[29]  David A. van Leeuwen,et al.  Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Lin-Shan Lee,et al.  Integrating frame-based and segment-based dynamic time warping for unsupervised spoken term detection with spoken queries , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Xavier Anguera Miró The Spoken Web Search Task at MediaEval 2011 , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[33]  Bin Ma,et al.  An acoustic segment modeling approach to query-by-example spoken term detection , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Lin-Shan Lee,et al.  Enhancing query expansion for semantic retrieval of spoken content with automatically discovered acoustic patterns , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Lin-Shan Lee,et al.  Toward unsupervised model-based spoken term detection with spoken queries without annotated data , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  Lin-Shan Lee,et al.  Unsupervised spoken term detection with spoken queries by multi-level acoustic patterns with varying model granularity , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).