论文信息 - Coupled Dictionaries for Exemplar-Based Speech Enhancement and Automatic Speech Recognition

Coupled Dictionaries for Exemplar-Based Speech Enhancement and Automatic Speech Recognition

Exemplar-based speech enhancement systems work by decomposing the noisy speech as a weighted sum of speech and noise exemplars stored in a dictionary and use the resulting speech and noise estimates to obtain a time-varying filter in the full-resolution frequency domain to enhance the noisy speech. To obtain the decomposition, exemplars sampled in lower dimensional spaces are preferred over the full-resolution frequency domain for their reduced computational complexity and the ability to better generalize to unseen cases. But the resulting filter may be sub-optimal as the mapping of the obtained speech and noise estimates to the full-resolution frequency domain yields a low-rank approximation. This paper proposes an efficient way to directly compute the full-resolution frequency estimates of speech and noise using coupled dictionaries: an input dictionary containing atoms from the desired exemplar space to obtain the decomposition and a coupled output dictionary containing exemplars from the full-resolution frequency domain. We also introduce modulation spectrogram features for the exemplar-based tasks using this approach. The proposed system was evaluated for various choices of input exemplars and yielded improved speech enhancement performances on the AURORA-2 and AURORA-4 databases. We further show that the proposed approach also results in improved word error rates (WERs) for the speech recognition tasks using HMM-GMM and deep-neural network (DNN) based systems.

[1] J. Larsen,et al. Reduction of non-stationary noise using a non-negative latent variable decomposition , 2008, 2008 IEEE Workshop on Machine Learning for Signal Processing.

[2] C. Schreiner,et al. Representation of amplitude modulation in the auditory cortex of the cat. I. The anterior auditory field (AAF) , 1986, Hearing Research.

[3] Geoffrey E. Hinton. A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[4] Haizhou Li,et al. Exemplar-based unit selection for voice conversion utilizing temporal information , 2013, INTERSPEECH.

[5] Björn W. Schuller,et al. Investigating NMF speech enhancement for neural network based acoustic models , 2014, INTERSPEECH.

[6] Tuomas Virtanen,et al. Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7] H. Sebastian Seung,et al. Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[8] Yoshitaka Nakajima,et al. Auditory Scene Analysis: The Perceptual Organization of Sound Albert S. Bregman , 1992 .

[9] Hugo Van hamme,et al. Coupled dictionary training for exemplar-based speech enhancement , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[11] Steven Greenberg,et al. Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..

[12] Jacob Benesty,et al. Enhancement of Single-Channel Periodic Signals in the Time-Domain , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[13] Tuomas Virtanen,et al. Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14] Hugo Van hamme,et al. Advances in noise robust digit recognition using hybrid exemplar-based techniques , 2012, INTERSPEECH.

[15] Tuomas Virtanen,et al. Modelling spectro-temporal dynamics in factorisation-based noise-robust automatic speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] W. Bastiaan Kleijn,et al. On causal algorithms for speech enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[17] Bhiksha Raj,et al. A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds , 2009, NIPS.

[18] Paris Smaragdis,et al. Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[19] George Saon,et al. Maximum likelihood discriminant feature spaces , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[20] Juhan Nam,et al. A super-resolution spectrogram using coupled PLCA , 2010, INTERSPEECH.

[21] David Pearce,et al. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[22] Bhiksha Raj,et al. Non-negative matrix factorization based compensation of music for automatic speech recognition , 2010, INTERSPEECH.

[23] Jonathan G. Fiscus,et al. A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[24] Philipos C. Loizou. Speech Enhancement (Signal Processing and Communications) , 2007 .

[25] Lukás Burget,et al. Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[26] H. Sebastian Seung,et al. Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[27] Mikkel N. Schmidt,et al. Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[28] S. Boll,et al. Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[29] Tuomas Virtanen,et al. Noise robust exemplar-based connected digit recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30] G. Wilkinson. The Journal of Laryngology and Otology THE SENSE OF HEARING , 2022 .

[31] Simon Doclo,et al. Single-channel dynamic exemplar-based speech enhancement , 2014, INTERSPEECH.

[32] H. Ney,et al. Linear discriminant analysis for improved large vocabulary continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33] Tom Barker,et al. Non-negative tensor factorisation of modulation spectrograms for monaural sound source separation , 2013, INTERSPEECH.

[34] Yariv Ephraim,et al. A Bayesian estimation approach for speech enhancement using hidden Markov models , 1992, IEEE Trans. Signal Process..

[35] Paris Smaragdis,et al. Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[36] Mehmet Gönen,et al. Coupled dimensionality reduction and classification for supervised and semi-supervised multilabel learning , 2014, Pattern Recognit. Lett..

[37] Jae S. Lim,et al. Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[38] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[39] Le Roux. Sparse NMF – half-baked or well done? , 2015 .

[40] Tara N. Sainath,et al. Sparse representation features for speech recognition , 2010, INTERSPEECH.

[41] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[42] Arne Leijon,et al. A new linear MMSE filter for single channel speech enhancement based on Nonnegative Matrix Factorization , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[43] Hugo Van hamme,et al. Exemplar-based noise robust automatic speech recognition using modulation spectrogram features , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[44] Steven Greenberg,et al. The modulation spectrogram: in pursuit of an invariant representation of speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[45] David Malah,et al. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[46] Thippur V. Sreenivas,et al. Codebook constrained Wiener filtering for speech enhancement , 1996, IEEE Trans. Speech Audio Process..

[47] Israel Cohen,et al. Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[48] Philipos C. Loizou,et al. Speech Enhancement: Theory and Practice , 2007 .