Embedding time warping in exemplar-based sparse representations of speech

This paper describes a new sparse representation model for speech that allows time warping as an extension to a recently proposed sparse representations-based speech recognition system. This recognition system uses exemplars to model the acoustics which are labeled speech occurrences of different length extracted from the training data. Exemplars are organized in multiple dictionaries on the basis of their class and length. Input speech segments are approximated as a sparse linear combination of the exemplars using these dictionaries and a reconstruction error-based decoding is adopted in order to find the best matching class sequence. With the current sparse representation model using a dictionary and a weight vector to approximate an input speech segment, it is not possible to compare input speech segments with exemplars of different lengths. The goal of this work is to introduce a novel sparse representation model which allows time warping using a third matrix which linearly combines consecutive frames in order to shrink or expand the approximation. Preliminary results have shown the feasibility of the proposed sparse representation model.

[1]  Steven Greenberg,et al.  Speech intelligibility in the presence of cross-channel spectral asynchrony , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Hugo Van hamme,et al.  Combining exemplar-based matching and exemplar-based sparse representations of speech , 2012, MLSLP.

[3]  K. Mardia,et al.  A review of image-warping methods , 1998 .

[4]  Tara N. Sainath,et al.  Exemplar-Based Processing for Speech Recognition: An Overview , 2012, IEEE Signal Processing Magazine.

[5]  Louis ten Bosch,et al.  Using sparse representations for exemplar based continuous digit recognition , 2009, 2009 17th European Signal Processing Conference.

[6]  Patrick Wambacq,et al.  Template-Based Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[8]  Shrikanth S. Narayanan,et al.  Novel Variations of Group Sparse Regularization Techniques With Applications to Noise Robust Automatic Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Tuomas Virtanen,et al.  Non-negative matrix deconvolution in noise robust speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Tuomas Virtanen,et al.  Exemplar-based speech enhancement and its application to noise-robust automatic speech recognition , 2011 .

[11]  Sergios Theodoridis,et al.  Recognition of isolated musical patterns using Context Dependent Dynamic Time Warping , 2002, 2002 11th European Signal Processing Conference.

[12]  Hugo Van hamme,et al.  Noise-robust digit recognition with exemplar-based sparse representations of variable length , 2012, 2012 IEEE International Workshop on Machine Learning for Signal Processing.

[13]  Tara N. Sainath,et al.  Sparse representation features for speech recognition , 2010, INTERSPEECH.

[14]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[15]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[16]  G. W. Hughes,et al.  Minimum Prediction Residual Principle Applied to Speech Recognition , 1975 .

[17]  Jithendra Vepa,et al.  Using posterior-based features in template matching for speech recognition , 2006, INTERSPEECH.

[18]  Patrick Wambacq,et al.  Data driven example based continuous speech recognition , 2003, INTERSPEECH.

[19]  Tuomas Virtanen,et al.  Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.