Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home

We describe the structure and application of an acoustic room simulator to generate large-scale simulated data for training deep neural networks for far-field speech recognition. The system simulates millions of different room dimensions, a wide distribution of reverberation time and signal-to-noise ratios, and a range of microphone and sound source locations. We start with a relatively clean training set as the source and artificially create simulated data by randomly sampling a noise configuration for every new training example. As a result, the acoustic model is trained using examples that are virtually never repeated. We evaluate performance of this approach based on room simulation using a factored complex Fast Fourier Transform (CFFT) acoustic model introduced in our earlier work, which uses CFFT layers and LSTM AMs for joint multichannel processing and acoustic modeling. Results show that the simulator-driven approach is quite effective in obtaining large improvements not only in simulated test conditions, but also in real / rerecorded conditions. This room simulation system has been employed in training acoustic models including the ones for the recently released Google Home.

[1]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[2]  Richard M. Stern,et al.  Physiologically-motivated synchrony-based processing for robust automatic speech recognition , 2006, INTERSPEECH.

[3]  Eric A. Lehmann,et al.  Reverberation-Time Prediction Method for Room Impulse Responses Simulated with the Image-Source Model , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[4]  John H. L. Hansen,et al.  A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition , 2008, Speech Commun..

[5]  Hyung-Min Park,et al.  Binaural and Multiple-Microphone Signal Processing Motivated by Auditory Perception , 2008, 2008 Hands-Free Speech Communication and Microphone Arrays.

[6]  Richard M. Stern,et al.  Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction , 2009, INTERSPEECH.

[7]  Richard M. Stern,et al.  Signal separation for robust speech recognition based on phase difference information obtained in the frequency domain , 2009, INTERSPEECH.

[8]  Richard M. Stern,et al.  Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Richard M. Stern,et al.  Nonlinear enhancement of onset for robust speech recognition , 2010, INTERSPEECH.

[10]  Richard M. Stern,et al.  Binaural sound source separation motivated by auditory processing , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[12]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[13]  Richard M. Stern,et al.  Two-microphone source separation algorithm based on statistical modeling of angle distributions , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Richard M. Stern,et al.  Robust speech recognition using temporal masking and thresholding algorithm , 2014, INTERSPEECH.

[17]  Chanwoo Kim,et al.  Sound source separation algorithm using phase difference and angle distribution modeling near the target , 2015, INTERSPEECH.

[18]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Richard M. Stern,et al.  Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Tara N. Sainath,et al.  Factored spatial and spectral multichannel raw waveform CLDNNs , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Tara N. Sainath,et al.  Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Mitch Weintraub,et al.  Acoustic Modeling for Google Home , 2017, INTERSPEECH.

[23]  Richard M. Stern,et al.  Robust Speech Recognition Based on Binaural Auditory Processing , 2017, INTERSPEECH.

[24]  Tara N. Sainath,et al.  Raw Multichannel Processing Using Deep Neural Networks , 2017, New Era for Robust Speech Recognition, Exploiting Deep Learning.