Automatic segmentation and clustering of speech using sparse coding and metaheuristic search

We propose a constrained shift and scale invariant sparse coding model for the purpose of unsupervised segmentation and clustering of speech into acoustically relevant sub-word units for automatic speech recognition. We introduce a novel local search algorithm that iteratively improves the acoustic relevance of the automatically-determined sub-word units from a random initialization by repeated alignment and subsequent re-estimation with the training material. We also contribute an associated population-based metaheuristic optimisation procedure related to genetic approaches to achieve a global search for the most acoustically relevant set of sub-word units. A first application of this metaheuristic search indicates that it yields an improvement over a corresponding local search. Using a subset of TIMIT for training, we also find that some of the automatically-determined sub-word units in our final dictionaries exhibit a strong correlation with the reference phonetic transcriptions. Furthermore, in some cases our sub-word transcriptions yield a compact set of often-used pronunciations. Informal listening tests indicate that the clustering is robust, and provides optimism that our approach will be suited to the task of generating pronunciation dictionaries that can be used for ASR.

[1]  W. J. Smit,et al.  Sparse Coding of Single Spoken Digits , 2013 .

[2]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[3]  James E. Baker,et al.  Adaptive Selection Methods for Genetic Algorithms , 1985, International Conference on Genetic Algorithms.

[4]  Li Deng,et al.  Are Sparse Representations Rich Enough for Acoustic Modeling? , 2012, INTERSPEECH.

[5]  Jean-Luc Gauvain,et al.  Acoustic unit discovery and pronunciation generation from a grapheme-based lexicon , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[6]  Sridhar Krishna Nemala,et al.  Sparse coding for speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  S. Osher,et al.  Coordinate descent optimization for l 1 minimization with application to compressed sensing; a greedy algorithm , 2009 .

[8]  Steve Young,et al.  The HTK hidden Markov model toolkit: design and philosophy , 1993 .

[9]  Yu Zhang,et al.  Joint Learning of Phonetic Units and Word Pronunciations for ASR , 2013, EMNLP.

[10]  Mikkel N. Schmidt,et al.  Shift Invariant Sparse Coding of Image and Music Data , 2007 .

[11]  Etienne Barnard,et al.  Continuous speech recognition with sparse coding , 2009, Comput. Speech Lang..

[12]  Y-Lan Boureau,et al.  Learning Convolutional Feature Hierarchies for Visual Recognition , 2010, NIPS.

[13]  Terrence J. Sejnowski,et al.  Coding Time-Varying Signals Using Sparse, Shift-Invariant Representations , 1998, NIPS.

[14]  Richard M. Stern,et al.  Automatic generation of subword units for speech recognition systems , 2002, IEEE Trans. Speech Audio Process..

[15]  J Baker,et al.  REDUCING BIAS AND NEFFICIENCY IN THE SELECTION ALGORITHM, GENETIC ALGORITHMS AND APPLICATIONS , 2000 .

[16]  Marzieh Razavi,et al.  An HMM-based formalism for automatic subword unit derivation and pronunciation generation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Bin Ma,et al.  Unsupervised mining of acoustic subword units with segment-level Gaussian posteriorgrams , 2013, INTERSPEECH.

[18]  Joseph Picone,et al.  Speech acoustic unit segmentation using hierarchical dirichlet processes , 2013, INTERSPEECH.

[19]  Bert Cranen,et al.  A computational model for unsupervised word discovery , 2007, INTERSPEECH.