Combining exemplar-based matching and exemplar-based sparse representations of speech

In this paper, we compare two different frameworks for exemplarbased speech recognition and propose a combined system that approximates the input speech as a linear combination of exemplars of variable length. This approach allows us not only to use multiple length long exemplars, each representing a certain speech unit, but also to jointly approximate input speech segments using several exemplars. While such an approach is able to model noisy speech, it also enforces a feature representation in which additivity of the effect of signal sources holds. This is observed to limit the recognition accuracy compared to e.g. discriminatively trained representations. We investigate the system performance starting from a baseline single-neighbor exemplar matching system using discriminative features to the proposed combined system to identify the main reasons of recognition errors. Even though the proposed approach has a lower recognition accuracy than the baseline, it significantly outperforms the intermediate systems using comparable features.

[1]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Patrick Wambacq,et al.  Template-Based Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Tuomas Virtanen,et al.  Non-negative matrix deconvolution in noise robust speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Louis ten Bosch,et al.  Using sparse representations for exemplar based continuous digit recognition , 2009, 2009 17th European Signal Processing Conference.

[5]  Kari Laurila,et al.  Noise robust speech recognition with state duration constraints , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Tuomas Virtanen,et al.  Noise robust exemplar-based connected digit recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Douglas D. O'Shaughnessy,et al.  Context-independent phoneme recognition using a K-Nearest Neighbour classification approach , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  S. Axelrod,et al.  Combination of hidden Markov models with dynamic time warping for speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[10]  Kris Demuynck,et al.  Extracting, modelling and combining information in speech recognition , 2001 .

[11]  Tuomas Virtanen,et al.  Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Tara N. Sainath,et al.  Sparse representation features for speech recognition , 2010, INTERSPEECH.

[13]  Jithendra Vepa,et al.  Improving speech recognition using a data-driven approach , 2005, INTERSPEECH.

[14]  Shrikanth S. Narayanan,et al.  Novel Variations of Group Sparse Regularization Techniques With Applications to Noise Robust Automatic Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.