On using heterogeneous data for vehicle-based speech recognition: A DNN-based approach

Most automatic speech recognition (ASR) systems incorporate a single source of information about their input, namely, features and transformations derived from the speech signal. However, in many applications, e.g., vehicle-based speech recognition, sensor data and environmental information are often available to complement audio information. In this paper, we show how these data can be used to improve hybrid DNN-HMM ASR systems for a vehicle-based speech recognition task. Feature fusion is accomplished by augmenting acoustic features with additional side information before being presented to the DNN acoustic model. The additional features are extracted from the vehicle speed, HVAC status, windshield wiper status, and vehicle type. This supplementary information improves the DNNs ability to discriminate phonetic events in an environment-aware way without having to make any modification to the DNN training algorithms. Experimental results show that heterogeneous data are effective irrespective of whether cross-entropy or sequence training is used. For CE training, a WER reduction of 6.3% is obtained, while sequential training reduces it by 5.5%.

[1]  Jun Du,et al.  A Feature Compensation Approach Using High-Order Vector Taylor Series Approximation of an Explicit Distortion Model for Noisy Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  R. Wiggins Minimum entropy deconvolution , 1978 .

[3]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  C. L. Nikias,et al.  Signal processing with higher-order spectra , 1993, IEEE Signal Processing Magazine.

[5]  DeLiang Wang,et al.  Joint noise adaptive training for robust automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[8]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[9]  Oriol Vinyals,et al.  Comparing multilayer perceptron to Deep Belief Network Tandem features for robust ASR , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[11]  Andreas Stolcke,et al.  SRILM at Sixteen: Update and Outlook , 2011 .

[12]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[13]  Ramesh A. Gopinath,et al.  Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[14]  Ramón Fernández Astudillo Integration of short-time Fourier domain speech enhancement and observation uncertainty techniques for robust automatic speech recognition , 2010 .

[15]  Marco Matassoni,et al.  An auditory based modulation spectral feature for reverberant speech recognition , 2010, INTERSPEECH.

[16]  Jan Nouza,et al.  CHiME Data Separation Based on Target Signal Cancellation and Noise Masking , 2011 .

[17]  Björn W. Schuller,et al.  Non-negative matrix factorization for highly noise-robust ASR: To enhance or to recognize? , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Bhiksha Raj,et al.  Speech denoising using nonnegative matrix factorization with priors , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Richard M. Schwartz,et al.  Practical Implementations of Speaker-Adaptive Training , 1997 .