Coupling binary masking and robust ASR

We present a novel framework for performing speech separation and robust automatic speech recognition (ASR) in a unified fashion. Separation is performed by estimating the ideal binary mask (IBM), which identifies speech dominant and noise dominant units in a time-frequency (T-F) representation of the noisy signal. ASR is performed on extracted cepstral features after binary masking. Previous systems perform these steps in a sequential fashion - separation followed by recognition. The proposed framework, which we call bidirectional speech decoding (BSD), unifies these two stages. It does this by using multiple IBM estimators each of which is designed specifically for a back-end acoustic phonetic unit (BPU) of the recognizer. The standard ASR decoder is modified to use these IBM estimators to obtain BPU-specific cepstra during likelihood calculation. On the Aurora-4 robust ASR task, the proposed framework obtains a relative improvement of 17% in word error rate over the noisy baseline. It also obtains significant improvements in the quality of the estimated IBM.

[1]  Hugo Van hamme,et al.  Model-based feature enhancement with uncertainty decoding for noise robust ASR , 2006, Speech Commun..

[2]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[3]  Deliang Wang,et al.  Role of mask pattern in intelligibility of ideal binary-masked noisy speech. , 2009, The Journal of the Acoustical Society of America.

[4]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[5]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[6]  Jesper Jensen,et al.  MMSE based noise PSD tracking with low complexity , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  DeLiang Wang,et al.  Unvoiced Speech Segregation From Nonspeech Interference via CASA and Spectral Subtraction , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[9]  DeLiang Wang,et al.  Nothing Doing : Re-evaluating Missing Feature ASR , 2011 .

[10]  DeLiang Wang,et al.  Robust speech recognition by integrating speech separation and hypothesis testing , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[11]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[12]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[13]  DeLiang Wang,et al.  A CASA-Based System for Long-Term SNR Estimation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  R.M. Stern,et al.  Missing-feature approaches in speech recognition , 2005, IEEE Signal Processing Magazine.

[15]  DeLiang Wang,et al.  Binary and ratio time-frequency masks for robust speech recognition , 2006, Speech Commun..

[16]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[17]  Naveen Parihar,et al.  Analysis of the Aurora large vocabulary evaluations , 2003, INTERSPEECH.

[18]  DeLiang Wang,et al.  Transforming Binary Uncertainties for Robust Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Hugo Van hamme,et al.  Advances in noise robust digit recognition using hybrid exemplar-based techniques , 2012, INTERSPEECH.

[20]  Mark J. F. Gales,et al.  HMM recognition in noise using parallel model combination , 1993, EUROSPEECH.

[21]  Yang Lu,et al.  An algorithm that improves speech intelligibility in noise for normal-hearing listeners. , 2009, The Journal of the Acoustical Society of America.

[22]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[23]  Hugo Van hamme,et al.  Advances in Missing Feature Techniques for Robust Large-Vocabulary Continuous Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Ning Ma,et al.  Coupling identification and reconstruction of missing features for noise-robust automatic speech recognition , 2012, INTERSPEECH.

[25]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[26]  Yifan Gong,et al.  A unified framework of HMM adaptation with joint compensation of additive and convolutive distortions , 2009, Computer Speech and Language.

[27]  DeLiang Wang,et al.  Exploring Monaural Features for Classification-Based Speech Segregation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Yifan Gong,et al.  Improvements to VTS feature enhancement , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Eric Fosler-Lussier,et al.  Improved Model Selection for the ASR-Driven Binary Mask , 2012, INTERSPEECH.

[30]  Steve Young,et al.  The HTK book , 1995 .

[31]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[32]  DeLiang Wang,et al.  Robust speech recognition from binary masks. , 2010, The Journal of the Acoustical Society of America.

[33]  DeLiang Wang,et al.  The role of binary mask patterns in automatic speech recognition in background noise. , 2013, The Journal of the Acoustical Society of America.