Joint sparse representation based cepstral-domain dereverberation for distant-talking speech recognition

In this paper we address reducing the mismatch between training and testing conditions for robust distant-talking speech recognition under realistic reverberant environments. It is well known that the distortions caused by reverberation, background noise, etc., are highly nonlinear in the cepstral domain. In this paper we propose to capture the complex relationships between clean and reverberant speech via joint dictionary learning. Given a test reverberant speech with a sequence of feature vectors we first find their sparse representations, and then estimate the underlying clean feature vectors using the dictionary of clean speech. Based on speech recognition experiments conducted under realistic reverberation conditions, the proposed method is shown to perform very well, resulting in an average relative improvement of 59.1% compared with the baseline front-ends.

[1]  Tara N. Sainath,et al.  Bayesian compressive sensing for phonetic classification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Guillermo Sapiro,et al.  Sparse Representation for Computer Vision and Pattern Recognition , 2010, Proceedings of the IEEE.

[3]  Satoshi Nakamura,et al.  CENSREC-4: development of evaluation framework for distant-talking speech recognition under reverberant environments , 2008, INTERSPEECH.

[4]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[5]  Biing-Hwang Juang,et al.  Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Bayya Yegnanarayana,et al.  Enhancement of reverberant speech using LP residual signal , 2000, IEEE Trans. Speech Audio Process..

[7]  Shinji Watanabe,et al.  Static and Dynamic Variance Compensation for Recognition of Reverberant Speech With Dereverberation Preprocessing , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[9]  Le Li,et al.  SENSC: a Stable and Efficient Algorithm for Nonnegative Sparse Coding: SENSC: a Stable and Efficient Algorithm for Nonnegative Sparse Coding , 2009 .

[10]  Joseph Sylvester Chang,et al.  A parametric formulation of the generalized spectral subtraction method , 1998, IEEE Trans. Speech Audio Process..

[11]  Alex Waibel,et al.  Far-Field Speaker Recognition , 2007, IEEE Trans. Speech Audio Process..

[12]  E.J. Candes,et al.  An Introduction To Compressive Sampling , 2008, IEEE Signal Processing Magazine.

[13]  DeLiang Wang,et al.  A two-stage algorithm for one-microphone reverberant speech enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Roland Maas,et al.  Reverberation Model-Based Decoding in the Logmelspec Domain for Robust Distant-Talking Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Tanja Schultz,et al.  Far-Field Speaker Recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.