论文信息 - Noise Robust Exemplar Matching Using Sparse Representations of Speech

Noise Robust Exemplar Matching Using Sparse Representations of Speech

Performing automatic speech recognition using exemplars (templates) holds the promise to provide a better duration and coarticulation modeling compared to conventional approaches such as hidden Markov models (HMMs). Exemplars are spectrographic representations of speech segments extracted from the training data, each associated with a speech unit, e.g. phones, syllables, half-words or words, and preserve the complete spectro-temporal content of the speech. Conventional exemplar-matching approaches to automatic speech recognition systems, such as those based on dynamic time warping, have typically focused on evaluation in clean conditions. In this paper, we propose a novel noise robust exemplar matching framework for automatic speech recognition. This recognizer approximates noisy speech segments as a weighted sum of speech and noise exemplars and performs recognition by comparing the reconstruction errors of different classes with respect to a divergence measure. We evaluate the system performance in keyword recognition on the small vocabulary track of the 2nd CHiME Challenge and connected digit recognition on the AURORA-2 database. The results show that the proposed system achieves comparable results with state-of-the-art noise robust recognition systems.

[1] D. Kanevsky,et al. ABCS : Approximate Bayesian Compressed Sensing , 2009 .

[2] Tuomas Virtanen,et al. Modelling non-stationary noise with spectral factorisation in automatic speech recognition , 2013, Comput. Speech Lang..

[3] Tuomas Virtanen,et al. Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[4] Xin Chen,et al. On the Effectiveness of Statistical Modeling Based Template Matching Approach for Continuous Speech Recognition , 2011, INTERSPEECH.

[5] Patrick Wambacq,et al. Data driven example based continuous speech recognition , 2003, INTERSPEECH.

[6] Hiroaki Sakoe,et al. A Dynamic Programming Approach to Continuous Speech Recognition , 1971 .

[7] Yunxin Zhao,et al. New Methods for Template Selection and Compression in Continuous Speech Recognition , 2011, INTERSPEECH.

[8] Louis ten Bosch,et al. Using sparse representations for exemplar based continuous digit recognition , 2009, 2009 17th European Signal Processing Conference.

[9] Andrzej Cichocki,et al. Csiszár's Divergences for Non-negative Matrix Factorization: Family of New Algorithms , 2006, ICA.

[10] Tuomas Virtanen,et al. Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11] Petros Drineas,et al. CUR matrix decompositions for improved data analysis , 2009, Proceedings of the National Academy of Sciences.

[12] Tuomas Virtanen,et al. Artificial and online acquired noise dictionaries for noise robust ASR , 2010, INTERSPEECH.

[13] Dirk Van Compernolle,et al. Boosting HMM performance with a memory upgrade , 2006, INTERSPEECH.

[14] Douglas D. O'Shaughnessy,et al. Context-independent phoneme recognition using a K-Nearest Neighbour classification approach , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15] Sergio Cruces,et al. Generalized Alpha-Beta Divergences and Their Application to Robust Nonnegative Matrix Factorization , 2011, Entropy.

[16] Biing-Hwang Juang,et al. A model-based connected-digit recognition system using either hidden Markov models or templates , 1986 .

[17] Hugo Van hamme,et al. Advances in noise robust digit recognition using hybrid exemplar-based techniques , 2012, INTERSPEECH.

[18] David Pearce,et al. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[19] Peter Bühlmann. Regression shrinkage and selection via the Lasso: a retrospective (Robert Tibshirani): Comments on the presentation , 2011 .

[20] Georg Heigold,et al. Speech recognition with state-based nearest neighbour classifiers , 2007, INTERSPEECH.

[21] Jithendra Vepa,et al. Improving speech recognition using a data-driven approach , 2005, INTERSPEECH.

[22] Hugo Van hamme,et al. Noise-robust digit recognition with exemplar-based sparse representations of variable length , 2012, 2012 IEEE International Workshop on Machine Learning for Signal Processing.

[23] Patrik O. Hoyer,et al. Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[24] Shrikanth S. Narayanan,et al. Novel Variations of Group Sparse Regularization Techniques With Applications to Noise Robust Automatic Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[25] Tuomas Virtanen,et al. Non-negative matrix deconvolution in noise robust speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Hugo Van hamme,et al. Noise-robust speech recognition with exemplar-based sparse representations using Alpha-Beta divergence , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Patrik O. Hoyer,et al. Non-negative sparse coding , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[28] Oded Ghitza,et al. Hidden Markov models with templates as non-stationary states: an application to speech recognition , 1993, Comput. Speech Lang..

[29] Hugo Van hamme,et al. Noise-robust automatic speech recognition with exemplar-based sparse representations using multiple length adaptive dictionaries , 2013 .

[30] Hugo Van hamme,et al. Combining exemplar-based matching and exemplar-based sparse representations of speech , 2012, MLSLP.

[31] Hugo Van hamme,et al. Embedding time warping in exemplar-based sparse representations of speech , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32] Bhiksha Raj,et al. Techniques for Noise Robustness in Automatic Speech Recognition , 2012, Techniques for Noise Robustness in Automatic Speech Recognition.

[33] Tuomas Virtanen,et al. Noise robust exemplar-based connected digit recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34] Li Deng,et al. Structure-based and template-based automatic speech recognition - comparing parametric and non-parametric approaches , 2007, INTERSPEECH.

[35] Tara N. Sainath,et al. Exemplar-Based Processing for Speech Recognition: An Overview , 2012, IEEE Signal Processing Magazine.

[36] Tara N. Sainath,et al. Sparse representation features for speech recognition , 2010, INTERSPEECH.

[37] S. Axelrod,et al. Combination of hidden Markov models with dynamic time warping for speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38] Tuomas Virtanen,et al. HMM-regularization for NMF-based noise robust ASR , 2013 .

[39] Victoria Stodden,et al. When Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts? , 2003, NIPS.

[40] Jon Barker,et al. The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[41] Ulpu Remes,et al. Techniques for Noise Robustness in Automatic Speech Recognition , 2012 .

[42] H. Zou,et al. Regularization and variable selection via the elastic net , 2005 .

[43] Hermann Ney,et al. The use of a one-stage dynamic programming algorithm for connected word recognition , 1984 .

[44] Dirk Van Compernolle,et al. Data pruning for template-based automatic speech recognition , 2010, INTERSPEECH.

[45] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[46] Tuomas Virtanen,et al. Exemplar-based speech enhancement and its application to noise-robust automatic speech recognition , 2011 .

[47] Jon Barker,et al. An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[48] Tuomas Virtanen,et al. Acquiring variable length speech bases for factorisation-based noise robust speech recognition , 2013, 21st European Signal Processing Conference (EUSIPCO 2013).

[49] Paris Smaragdis,et al. Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[50] Hugo Van hamme,et al. Exemplar selection techniques for sparse representations of speech using multiple dictionaries , 2013, 21st European Signal Processing Conference (EUSIPCO 2013).

[51] Patrick Wambacq,et al. Template-Based Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[52] R. Tibshirani,et al. Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[53] Jerome R. Bellegarda,et al. Latent perceptual mapping: a new acoustic modeling framework for speech recognition , 2010, INTERSPEECH.

[54] P. Smaragdis,et al. Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[55] Nathan Halko,et al. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[56] Bhiksha Raj,et al. Non-negative matrix factorization based compensation of music for automatic speech recognition , 2010, INTERSPEECH.