A Front-End Speech Enhancement System for Robust Automotive Speech Recognition

This paper presents a front-end speech enhancement approach to robust speech recognition in automotive environments. It combines model-based voice activity detection (VAD), relative transfer function (RTF) based generalized sidelobe cancelation, and single-channel post filtering to enhance the speech signal of interest, thereby improving the robustness of speech recognition. First, we choose four typical driving scenarios, which include most of the noise types in automobiles to record training data. The recorded data are then used to train Gaussian mixture models (GMMs) for both speech and noise. The trained GMMs are subsequently used to estimate the speech presence probability on a frame-by-frame basis. This speech presence probability is then served as the basic information for RTF estimation, adaptive beamforming, and post-filtering. Experiments are conducted in real automotive environments and the results show that the developed method can significantly improve the performance of both VAD and automatic speech recognition (ASR).

[1]  Jacob Benesty,et al.  A Framework for Speech Enhancement With Ad Hoc Microphone Arrays , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  John H. L. Hansen,et al.  Discriminative Training for Multiple Observation Likelihood Ratio Based Voice Activity Detection , 2010, IEEE Signal Processing Letters.

[3]  DeLiang Wang,et al.  Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection , 2014, INTERSPEECH.

[4]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[5]  Israel Cohen,et al.  Voice Activity Detection in Presence of Transient Noise Using Spectral Clustering , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Yunde Jia,et al.  Voice Activity Detection Via Noise Reducing Using Non-Negative Sparse Coding , 2013, IEEE Signal Processing Letters.

[7]  I. Cohen,et al.  Noise estimation by minima controlled recursive averaging for robust speech enhancement , 2002, IEEE Signal Processing Letters.

[8]  Thad Hughes,et al.  Recurrent neural networks for voice activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Sven Nordholm,et al.  Optimal and Adaptive Microphone Arrays for Speech Input in Automobiles , 2001, Microphone Arrays.

[10]  Jingdong Chen,et al.  Microphone Array Signal Processing , 2008 .

[11]  Lei Huang,et al.  Robust Adaptive Beamforming With a Novel Interference-Plus-Noise Covariance Matrix Reconstruction Method , 2015, IEEE Transactions on Signal Processing.

[12]  Xiao-Lei Zhang,et al.  Deep Belief Networks Based Voice Activity Detection , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  David Ayllón,et al.  Optimum microphone array for hands-free devices in a car , 2011 .

[14]  M. Krini,et al.  Model-based speech enhancement for automotive applications , 2009, 2009 Proceedings of 6th International Symposium on Image and Signal Processing and Analysis.

[15]  Israel Cohen,et al.  Multichannel post-filtering in nonstationary noise environments , 2004, IEEE Transactions on Signal Processing.

[16]  Mark Liberman,et al.  Speech activity detection on youtube using deep neural networks , 2013, INTERSPEECH.

[17]  Sridha Sridharan,et al.  Low-cost hardware speech enhancement for improved speech recognition in automotive environments , 2010 .

[18]  Ehud Weinstein,et al.  Analysis of the power spectral deviation of the general transfer function GSC , 2004, IEEE Transactions on Signal Processing.

[19]  Mingjiang Wang,et al.  Speech enhancement for nonstationary noise environments , 2017, 2017 IEEE 17th International Conference on Communication Technology (ICCT).

[20]  Jianwu Dang,et al.  Voice Activity Detection Based on an Unsupervised Learning Framework , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Israel Cohen,et al.  Relative transfer function identification using speech signals , 2004, IEEE Transactions on Speech and Audio Processing.

[22]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[23]  S. Gannot,et al.  Speech enhancement based on the general transfer function GSC and postfiltering , 2004, IEEE Trans. Speech Audio Process..

[24]  Jen-Tzung Chien,et al.  Car Speech Enhancement Using a Microphone Array , 2005, Int. J. Speech Technol..

[25]  Hing-Cheung So,et al.  Speech enhancement in car noise envoronment based on an analysis-synthesis approach using harmonic noise model , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Joon-Hyuk Chang,et al.  Voice activity detection based on statistical models and machine learning approaches , 2010, Comput. Speech Lang..

[27]  Markus Buck,et al.  FIRST ORDER DIFFERENTIAL MICROPHONE ARRAYS FOR AUTOMOTIVE APPLICATIONS , 2001 .

[28]  Richard M. Stern,et al.  Robust speech recognition in the automobile , 1994, ICSLP.

[29]  Gerhard Schmidt,et al.  A compact microphone array system with spatial post-filtering for automotive applications , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Hoirin Kim,et al.  Multiple Acoustic Model-Based Discriminative Likelihood Ratio Weighting for Voice Activity Detection , 2012, IEEE Signal Processing Letters.