Sparse modeling of neural network posterior probabilities for exemplar-based speech recognition

Automatic speech recognition can be cast as a realization of compressive sensing.Posterior probabilities are suitable features for exemplar-based sparse modeling.Posterior-based sparse representation meets statistical speech recognition formalism.Dictionary learning reduces collection size of exemplars and improves the performance.Collaborative hierarchical sparsity exploits temporal information in continuous speech. In this paper, a compressive sensing (CS) perspective to exemplar-based speech processing is proposed. Relying on an analytical relationship between CS formulation and statistical speech recognition (Hidden Markov Models - HMM), the automatic speech recognition (ASR) problem is cast as recovery of high-dimensional sparse word representation from the observed low-dimensional acoustic features. The acoustic features are exemplars obtained from (deep) neural network sub-word conditional posterior probabilities. Low-dimensional word manifolds are learned using these sub-word posterior exemplars and exploited to construct a linguistic dictionary for sparse representation of word posteriors. Dictionary learning has been found to be a principled way to alleviate the need of having huge collection of exemplars as required in conventional exemplar-based approaches, while still improving the performance. Context appending and collaborative hierarchical sparsity are used to exploit the sequential and group structure underlying word sparse representation. This formulation leads to a posterior-based sparse modeling approach to speech recognition. The potential of the proposed approach is demonstrated on isolated word (Phonebook corpus) and continuous speech (Numbers corpus) recognition tasks.

[1]  Louis ten Bosch,et al.  Using sparse representations for exemplar based continuous digit recognition , 2009, 2009 17th European Signal Processing Conference.

[2]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[3]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[4]  Hervé Bourlard,et al.  Analysis of phone posterior feature space exploiting class-specific sparsity and MLP-based similarity measure , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Tuomas Virtanen,et al.  Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Tara N. Sainath,et al.  Exemplar-Based Processing for Speech Recognition: An Overview , 2012, IEEE Signal Processing Magazine.

[7]  Patrick Kenny,et al.  A linear predictive HMM for vector-valued observations with applications to speech recognition , 1990, IEEE Trans. Acoust. Speech Signal Process..

[8]  Tara N. Sainath,et al.  Sparse representation features for speech recognition , 2010, INTERSPEECH.

[9]  M. Elad,et al.  $rm K$-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation , 2006, IEEE Transactions on Signal Processing.

[10]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[11]  Hervé Bourlard,et al.  Posterior features applied to speech recognition tasks with user-defined vocabulary , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[13]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[14]  S. Mallat,et al.  Adaptive greedy approximations , 1997 .

[15]  Yonina C. Eldar,et al.  C-HiLasso: A Collaborative Hierarchical Sparse Modeling Framework , 2010, IEEE Transactions on Signal Processing.

[16]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Stephen J. Wright,et al.  Computational Methods for Sparse Solution of Linear Inverse Problems , 2010, Proceedings of the IEEE.

[18]  Guillermo Aradilla Acoustic Models for Posterior Features in Speech Recognition , 2008 .

[19]  Patrick Wambacq,et al.  Template-Based Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Hervé Bourlard,et al.  Using KL-based acoustic models in a large vocabulary recognition task , 2008, INTERSPEECH.

[21]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[22]  Aren Jansen,et al.  Intrinsic Fourier Analysis on the Manifold of Speech Sounds , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[23]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[24]  Hervé Bourlard,et al.  Posterior features for template-based ASR , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Hong C. Leung,et al.  PhoneBook: a phonetically-rich isolated-word telephone-speech database , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[26]  Hervé Bourlard,et al.  MLP based hierarchical system for task adaptation in ASR , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[27]  Volkan Cevher,et al.  Model-based compressive sensing for multi-party distant speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Tara N. Sainath,et al.  Exemplar-Based Sparse Representation Features: From TIMIT to LVCSR , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Hervé Bourlard,et al.  Posterior-based sparse representation for automatic speech recognition , 2014, INTERSPEECH.

[30]  Ronald A. Cole,et al.  New telephone speech corpora at CSLU , 1995, EUROSPEECH.