A Direct Masking Approach to Robust ASR

Recently, much work has been devoted to the computation of binary masks for speech segregation. Conventional wisdom in the field of ASR holds that these binary masks cannot be used directly; the missing energy significantly affects the calculation of the cepstral features commonly used in ASR. We show that this commonly held belief may be a misconception; we demonstrate the effectiveness of directly using the masked data on both a small and large vocabulary dataset. In fact, this approach, which we term the direct masking approach, performs comparably to two previously proposed missing feature techniques. We also investigate the reasons why other researchers may have not come to this conclusion; variance normalization of the features is a significant factor in performance. This work suggests a much better baseline than unenhanced speech for future work in missing feature ASR.

[1]  DeLiang Wang,et al.  Integrating computational auditory scene analysis and automatic speech recognition , 2006 .

[2]  S. Hanson,et al.  Some Solutions to the Missing Feature Problem in Vision , 1993 .

[3]  Jon Barker,et al.  Soft decisions in missing data techniques for robust automatic speech recognition , 2000, INTERSPEECH.

[4]  John H. L. Hansen,et al.  Missing-Feature Reconstruction by Leveraging Temporal Spectral Correlation for Robust Speech Recognition in Background Noise Conditions , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Phil D. Green,et al.  Handling missing data in speech recognition , 1994, ICSLP.

[6]  Hugo Van hamme,et al.  Multi-candidate missing data imputation for robust speech recognition , 2012, EURASIP Journal on Audio, Speech, and Music Processing.

[7]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[8]  Jesper Jensen,et al.  Minimum Mean-Square Error Estimation of Discrete Fourier Coefficients With Generalized Gamma Priors , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  DeLiang Wang,et al.  Transforming Binary Uncertainties for Robust Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[11]  Eric Fosler-Lussier,et al.  Investigations into the incorporation of the Ideal Binary Mask in ASR , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Jesper Jensen,et al.  MMSE based noise PSD tracking with low complexity , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  R.M. Stern,et al.  Missing-feature approaches in speech recognition , 2005, IEEE Signal Processing Magazine.

[14]  Steve Young,et al.  The HTK book , 1995 .

[15]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[16]  Richard Lippmann,et al.  Using missing feature theory to actively select features for robust speech recognition with interruptions, filtering and noise KN-37 , 1997, EUROSPEECH.

[17]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[18]  Jeff A. Bilmes,et al.  MVA Processing of Speech Features , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Abeer Alwan,et al.  A novel approach to soft-mask estimation and Log-Spectral enhancement for robust speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Mitch Weintraub The GRASP sound separation system , 1984, ICASSP.

[21]  DeLiang Wang,et al.  The role of binary mask patterns in automatic speech recognition in background noise. , 2013, The Journal of the Acoustical Society of America.

[22]  Richard M. Stern,et al.  Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[23]  Hynek Hermansky,et al.  Recognition of speech in additive and convolutional noise based on RASTA spectral processing , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[25]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[26]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[27]  DeLiang Wang,et al.  Robust speech recognition by integrating speech separation and hypothesis testing , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[28]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[29]  DeLiang Wang,et al.  Speech intelligibility in background noise with ideal binary time-frequency masking. , 2009, The Journal of the Acoustical Society of America.

[30]  DeLiang Wang,et al.  Binary and ratio time-frequency masks for robust speech recognition , 2006, Speech Commun..

[31]  B. Cranen,et al.  Noise reduction through compressed sensing , 2008, INTERSPEECH.

[32]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[33]  Hugo Van hamme,et al.  Advances in Missing Feature Techniques for Robust Large-Vocabulary Continuous Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Peng Li,et al.  Monaural speech separation based on MAXVQ and CASA for robust speech recognition , 2010, Comput. Speech Lang..

[35]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.

[36]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[37]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..