Multichannel feature enhancement in distributed microphone arrays for robust distant speech recognition in smart rooms

Room reverberation and environmental noise present challenges for integration of speech recognition technology in smart room applications. We present a multichannel enhancement framework for distributed microphone arrays to mitigate the effects of both additive noise and reverberation on distant-talking microphones. The proposed approach uses techniques of nonnegative matrix and tensor factorization to achieve both noise suppression (through sparse representation of speech spectra) and dereverberation (through decomposition of magnitude spectra into convolutive components). Results of ASR experiments on the DIRHA-GRID corpus confirm that the proposed approach can achieve relative improvements of up to +20% in recognition accuracy in highly reverberant and noisy conditions using clean-trained models.

[1]  Nobutaka Ito,et al.  Blind alignment of asynchronously recorded signals for distributed microphone array , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[2]  Andrzej Cichocki,et al.  Nonnegative Matrix and Tensor Factorization T , 2007 .

[3]  Tomohiro Nakatani,et al.  Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition , 2012, IEEE Signal Process. Mag..

[4]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[6]  John H. L. Hansen,et al.  Blind reverberation mitigation for robust speaker identification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Ramón Fernández Astudillo,et al.  The DIRHA-GRID corpus: baseline and tools for multi-room distant speech recognition using distributed microphones , 2014, INTERSPEECH.

[8]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[9]  Bhiksha Raj,et al.  Non-negative matrix factorization based compensation of music for automatic speech recognition , 2010, INTERSPEECH.

[10]  J. Hansen,et al.  Multichannel speech dereverberation based on convolutive nonnegative tensor factorization for ASR applications , 2014, INTERSPEECH.

[11]  Tara N. Sainath,et al.  Exemplar-Based Sparse Representation Features: From TIMIT to LVCSR , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Hugo Van hamme,et al.  Compressive Sensing for Missing Data Imputation in Noise Robust Speech Recognition , 2010, IEEE Journal of Selected Topics in Signal Processing.

[13]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[14]  Israel Cohen,et al.  System Identification in the Short-Time Fourier Transform Domain With Crossband Filtering , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Hirokazu Kameoka,et al.  Robust speech dereverberation based on non-negativity and sparse nature of speech spectrograms , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Tuomas Virtanen,et al.  Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.