Noise robust exemplar-based connected digit recognition

This paper proposes a noise robust exemplar-based speech recognition system where noisy speech is modeled as a linear combination of a set of speech and noise exemplars. The method works by finding a small number of labeled exemplars in a very large collection of speech and noise exemplars that jointly approximate the observed speech signal. We represent the exemplars using melenergies, which allows modeling the summation of speech and noise, and estimate the activations of the exemplars by minimizing the generalized Kullback-Leibler divergence between the observations and the model. The activations of the speech exemplars are directly being used for recognition. This approach proves to be promising, achieving up to 55.8% accuracy at signal-to-noise ratio −5 dB on the AURORA-2 connected digit recognition task.

[1]  Mikkel N. Schmidt,et al.  Linear Regression on Sparse Features for Single-Channel Speech Separation , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[2]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[4]  Tuomas Virtanen,et al.  Spectral covariance in prior distributions of non-negative matrix factorization based speech separation , 2009, 2009 17th European Signal Processing Conference.

[5]  Simon J. Godsill,et al.  Bayesian extensions to non-negative matrix factorisation for audio signal modelling , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Patrick Wambacq,et al.  Template-Based Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Li Deng,et al.  Structure-based and template-based automatic speech recognition - comparing parametric and non-parametric approaches , 2007, INTERSPEECH.

[9]  Hynek Hermansky,et al.  Towards increasing speech recognition error rates , 1995, Speech Commun..

[10]  Hugo Van hamme,et al.  Robust speech recognition using missing data techniques in the prospect domain and fuzzy masks , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  S. Goldinger Echoes of echoes? An episodic theory of lexical access. , 1998, Psychological review.

[12]  B. Cranen,et al.  Noise robust digit recognition using sparse representations , 2008 .

[13]  P. Smaragdis,et al.  Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[14]  Louis ten Bosch,et al.  Using sparse representations for exemplar based continuous digit recognition , 2009, 2009 17th European Signal Processing Conference.

[15]  R.M. Stern,et al.  Missing-feature approaches in speech recognition , 2005, IEEE Signal Processing Magazine.