Noise-robust digit recognition with exemplar-based sparse representations of variable length

This paper introduces an exemplar-based noise-robust digit recognition system in which noisy speech is modeled as a sparse linear combination of clean speech and noise exemplars. Exemplars are rigid long speech units of different lengths, i.e. no warping mechanism is used for exemplar matching to avoid poor time alignments that would otherwise be provoked by the noise and the natural duration distribution of each unit in the training data is preserved. Speech and noise separation is performed by applying non-negative sparse coding using a separate exemplar dictionary for each labeled unit (in this case half-digits) rather than a single dictionary of all units. This approach does not only provide better classification of speech units but also models the temporal structure of speech and noise more accurately. The system performance is evaluated on the AURORA-2 database. The results show that the proposed system performs significantly better than a comparable system using a single dictionary at positive SNR levels.

[1]  Tara N. Sainath,et al.  An analysis of sparseness and regularization in exemplar-based methods for speech classification , 2010, INTERSPEECH.

[2]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[3]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[4]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[5]  Patrick Wambacq,et al.  Template-Based Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Paris Smaragdis,et al.  A non-negative approach to semi-supervised separation of speech from noise with the use of temporal dynamics , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Tuomas Virtanen,et al.  Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Danny Crookes,et al.  A Corpus-Based Approach to Speech Enhancement From Nonstationary Noise , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Tuomas Virtanen,et al.  Noise robust exemplar-based connected digit recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  R.M. Stern,et al.  Missing-feature approaches in speech recognition , 2005, IEEE Signal Processing Magazine.

[12]  Louis ten Bosch,et al.  Using sparse representations for exemplar based continuous digit recognition , 2009, 2009 17th European Signal Processing Conference.

[13]  Douglas D. O'Shaughnessy,et al.  Context-independent phoneme recognition using a K-Nearest Neighbour classification approach , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Tuomas Virtanen,et al.  Non-negative matrix deconvolution in noise robust speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).