Noise Robust Exemplar Matching Using Sparse Representations of Speech

Performing automatic speech recognition using exemplars (templates) holds the promise to provide a better duration and coarticulation modeling compared to conventional approaches such as hidden Markov models (HMMs). Exemplars are spectrographic representations of speech segments extracted from the training data, each associated with a speech unit, e.g. phones, syllables, half-words or words, and preserve the complete spectro-temporal content of the speech. Conventional exemplar-matching approaches to automatic speech recognition systems, such as those based on dynamic time warping, have typically focused on evaluation in clean conditions. In this paper, we propose a novel noise robust exemplar matching framework for automatic speech recognition. This recognizer approximates noisy speech segments as a weighted sum of speech and noise exemplars and performs recognition by comparing the reconstruction errors of different classes with respect to a divergence measure. We evaluate the system performance in keyword recognition on the small vocabulary track of the 2nd CHiME Challenge and connected digit recognition on the AURORA-2 database. The results show that the proposed system achieves comparable results with state-of-the-art noise robust recognition systems.

[1]  D. Kanevsky,et al.  ABCS : Approximate Bayesian Compressed Sensing , 2009 .

[2]  Tuomas Virtanen,et al.  Modelling non-stationary noise with spectral factorisation in automatic speech recognition , 2013, Comput. Speech Lang..

[3]  Tuomas Virtanen,et al.  Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Xin Chen,et al.  On the Effectiveness of Statistical Modeling Based Template Matching Approach for Continuous Speech Recognition , 2011, INTERSPEECH.

[5]  Patrick Wambacq,et al.  Data driven example based continuous speech recognition , 2003, INTERSPEECH.

[6]  Hiroaki Sakoe,et al.  A Dynamic Programming Approach to Continuous Speech Recognition , 1971 .

[7]  Yunxin Zhao,et al.  New Methods for Template Selection and Compression in Continuous Speech Recognition , 2011, INTERSPEECH.

[8]  Louis ten Bosch,et al.  Using sparse representations for exemplar based continuous digit recognition , 2009, 2009 17th European Signal Processing Conference.

[9]  Andrzej Cichocki,et al.  Csiszár's Divergences for Non-negative Matrix Factorization: Family of New Algorithms , 2006, ICA.

[10]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Petros Drineas,et al.  CUR matrix decompositions for improved data analysis , 2009, Proceedings of the National Academy of Sciences.

[12]  Tuomas Virtanen,et al.  Artificial and online acquired noise dictionaries for noise robust ASR , 2010, INTERSPEECH.

[13]  Dirk Van Compernolle,et al.  Boosting HMM performance with a memory upgrade , 2006, INTERSPEECH.

[14]  Douglas D. O'Shaughnessy,et al.  Context-independent phoneme recognition using a K-Nearest Neighbour classification approach , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Sergio Cruces,et al.  Generalized Alpha-Beta Divergences and Their Application to Robust Nonnegative Matrix Factorization , 2011, Entropy.

[16]  Biing-Hwang Juang,et al.  A model-based connected-digit recognition system using either hidden Markov models or templates , 1986 .

[17]  Hugo Van hamme,et al.  Advances in noise robust digit recognition using hybrid exemplar-based techniques , 2012, INTERSPEECH.

[18]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[19]  Peter Bühlmann Regression shrinkage and selection via the Lasso: a retrospective (Robert Tibshirani): Comments on the presentation , 2011 .

[20]  Georg Heigold,et al.  Speech recognition with state-based nearest neighbour classifiers , 2007, INTERSPEECH.

[21]  Jithendra Vepa,et al.  Improving speech recognition using a data-driven approach , 2005, INTERSPEECH.

[22]  Hugo Van hamme,et al.  Noise-robust digit recognition with exemplar-based sparse representations of variable length , 2012, 2012 IEEE International Workshop on Machine Learning for Signal Processing.

[23]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[24]  Shrikanth S. Narayanan,et al.  Novel Variations of Group Sparse Regularization Techniques With Applications to Noise Robust Automatic Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Tuomas Virtanen,et al.  Non-negative matrix deconvolution in noise robust speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Hugo Van hamme,et al.  Noise-robust speech recognition with exemplar-based sparse representations using Alpha-Beta divergence , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Patrik O. Hoyer,et al.  Non-negative sparse coding , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[28]  Oded Ghitza,et al.  Hidden Markov models with templates as non-stationary states: an application to speech recognition , 1993, Comput. Speech Lang..

[29]  Hugo Van hamme,et al.  Noise-robust automatic speech recognition with exemplar-based sparse representations using multiple length adaptive dictionaries , 2013 .

[30]  Hugo Van hamme,et al.  Combining exemplar-based matching and exemplar-based sparse representations of speech , 2012, MLSLP.

[31]  Hugo Van hamme,et al.  Embedding time warping in exemplar-based sparse representations of speech , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Bhiksha Raj,et al.  Techniques for Noise Robustness in Automatic Speech Recognition , 2012, Techniques for Noise Robustness in Automatic Speech Recognition.

[33]  Tuomas Virtanen,et al.  Noise robust exemplar-based connected digit recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Li Deng,et al.  Structure-based and template-based automatic speech recognition - comparing parametric and non-parametric approaches , 2007, INTERSPEECH.

[35]  Tara N. Sainath,et al.  Exemplar-Based Processing for Speech Recognition: An Overview , 2012, IEEE Signal Processing Magazine.

[36]  Tara N. Sainath,et al.  Sparse representation features for speech recognition , 2010, INTERSPEECH.

[37]  S. Axelrod,et al.  Combination of hidden Markov models with dynamic time warping for speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  Tuomas Virtanen,et al.  HMM-regularization for NMF-based noise robust ASR , 2013 .

[39]  Victoria Stodden,et al.  When Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts? , 2003, NIPS.

[40]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[41]  Ulpu Remes,et al.  Techniques for Noise Robustness in Automatic Speech Recognition , 2012 .

[42]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[43]  Hermann Ney,et al.  The use of a one-stage dynamic programming algorithm for connected word recognition , 1984 .

[44]  Dirk Van Compernolle,et al.  Data pruning for template-based automatic speech recognition , 2010, INTERSPEECH.

[45]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[46]  Tuomas Virtanen,et al.  Exemplar-based speech enhancement and its application to noise-robust automatic speech recognition , 2011 .

[47]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[48]  Tuomas Virtanen,et al.  Acquiring variable length speech bases for factorisation-based noise robust speech recognition , 2013, 21st European Signal Processing Conference (EUSIPCO 2013).

[49]  Paris Smaragdis,et al.  Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  Hugo Van hamme,et al.  Exemplar selection techniques for sparse representations of speech using multiple dictionaries , 2013, 21st European Signal Processing Conference (EUSIPCO 2013).

[51]  Patrick Wambacq,et al.  Template-Based Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[52]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[53]  Jerome R. Bellegarda,et al.  Latent perceptual mapping: a new acoustic modeling framework for speech recognition , 2010, INTERSPEECH.

[54]  P. Smaragdis,et al.  Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[55]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[56]  Bhiksha Raj,et al.  Non-negative matrix factorization based compensation of music for automatic speech recognition , 2010, INTERSPEECH.