Mask estimation and imputation methods for missing data speech recognition in a multisource reverberant environment

We present an automatic speech recognition system that uses a missing data approach to compensate for challenging environmental noise containing both additive and convolutive components. The unreliable and noise-corrupted (''missing'') components are identified using a Gaussian mixture model (GMM) classifier based on a diverse range of acoustic features. To perform speech recognition using the partially observed data, the missing components are substituted with clean speech estimates computed using both sparse imputation and cluster-based GMM imputation. Compared to two reference mask estimation techniques based on interaural level and time difference-pairs, the proposed missing data approach significantly improved the keyword accuracy rates in all signal-to-noise ratio conditions when evaluated on the CHiME reverberant multisource environment corpus. Of the imputation methods, cluster-based imputation was found to outperform sparse imputation. The highest keyword accuracy was achieved when the system was trained on imputed data, which made it more robust to possible imputation errors.

[1]  Anthony J. Watkins,et al.  Perceptual compensation for reverberation in speech identification: Effects of single-band, multiple-band and wideband noise contexts , 2007 .

[2]  C. Faller,et al.  Source localization in complex listening situations: selection of binaural cues based on interaural coherence. , 2004, The Journal of the Acoustical Society of America.

[3]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  B. Eaves On Quadratic Programming , 1971 .

[5]  Richard M. Stern,et al.  Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[6]  Guy J. Brown,et al.  Mask Estimation and Sparse Imputation for Missing Data Speech Recognition in Multisource Reverberant Environments , 2011 .

[7]  Krzysztof Marasek,et al.  SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.

[8]  Hugo Van hamme,et al.  Application of noise robust MDT speech recognition on the SPEECON and speechdat-car databases , 2009, INTERSPEECH.

[9]  José Escolano,et al.  Evaluation of generalized cross-correlation methods for direction of arrival estimation using two microphones in real environments , 2012 .

[10]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[11]  Yoshihiko Nankaku,et al.  GMM-Based Missing-Feature Reconstruction on Multi-Frame Windows , 2011, INTERSPEECH.

[12]  Mikko Kurimo,et al.  Unlimited vocabulary speech recognition with morph language models applied to Finnish , 2006, Comput. Speech Lang..

[13]  Phil D. Green,et al.  Handling missing data in speech recognition , 1994, ICSLP.

[14]  M. Tikander,et al.  Binaural positioning system for wearable augmented reality audio , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[15]  Bhiksha Raj,et al.  Non-negative matrix factorization based compensation of music for automatic speech recognition , 2010, INTERSPEECH.

[16]  Kalle J. Palomäki,et al.  A reverberation‐robust automatic speech recognition system based on temporal masking , 2008 .

[17]  Hugo Van hamme,et al.  Compressive Sensing for Missing Data Imputation in Noise Robust Speech Recognition , 2010, IEEE Journal of Selected Topics in Signal Processing.

[18]  Steven Greenberg,et al.  Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..

[19]  Guy J. Brown,et al.  Mask estimation for missing data speech recognition based on statistics of binaural interaction , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[21]  Guy J. Brown,et al.  Techniques for handling convolutional distortion with 'missing data' automatic speech recognition , 2004, Speech Commun..

[22]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[23]  Phil D. Green,et al.  Missing data theory, spectral subtraction and signal-to-noise estimation for robust ASR: an integrated study , 1999, EUROSPEECH.

[24]  Ning Ma,et al.  The PASCAL CHiME speech separation and recognition challenge , 2013, Comput. Speech Lang..

[25]  Patrick M. Zurek,et al.  The Precedence Effect , 1987 .

[26]  R.M. Stern,et al.  Missing-feature approaches in speech recognition , 2005, IEEE Signal Processing Magazine.

[27]  Richard M. Stern,et al.  A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition , 2004, Speech Commun..

[28]  H. Van hamme,et al.  Robust speech recognition using cepstral domain missing data techniques and noisy masks , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Bert Cranen,et al.  On the relation between statistical properties of spectrographic masks and recognition accuracy , 2008 .

[30]  Bert Cranen,et al.  Sparse imputation for large vocabulary noise robust ASR , 2011, Comput. Speech Lang..

[31]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[32]  J. Foote,et al.  WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 1995 .

[33]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[34]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[35]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..