Factored spatial and spectral multichannel raw waveform CLDNNs

Multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. Recently, we explored doing multichannel enhancement jointly with acoustic modeling, where beamforming and frequency decomposition was folded into one layer of the neural network [1, 2]. In this paper, we explore factoring these operations into separate layers in the network. Furthermore, we explore using multi-task learning (MTL) as a proxy for postfiltering, where we train the network to predict "clean" features as well as context-dependent states. We find that with the factored architecture, we can achieve a 10% relative improvement in WER over a single channel and a 5% relative improvement over the unfactored model from [1] on a 2,000-hour Voice Search task. In addition, by incorporating MTL, we can achieve 11% and 7% relative improvements over single channel and unfactored multichannel models, respectively.

[1]  Boaz Rafaely,et al.  Microphone Array Signal Processing , 2008 .

[2]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[3]  Ron J. Weiss,et al.  Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Georg Heigold,et al.  Asynchronous stochastic optimization for sequence training of deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Tara N. Sainath,et al.  Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[6]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Tara N. Sainath,et al.  Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[8]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Andreas Stolcke,et al.  The SRI-ICSI Spring 2007 Meeting and Lecture Recognition System , 2007, CLEAR.

[10]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[11]  S. King,et al.  In Proc. Interspeech , 2009 .

[12]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[13]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[14]  R. Zelinski,et al.  A microphone array with adaptive post-filtering for noise reduction in reverberant rooms , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[15]  Jinyu Li,et al.  Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks. , 2013, ICLR 2013.

[16]  Lukás Burget,et al.  Transcribing Meetings With the AMIDA Systems , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[18]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[19]  Masakiyo Fujimoto,et al.  LINEAR PREDICTION-BASED DEREVERBERATION WITH ADVANCED SPEECH ENHANCEMENT AND RECOGNITION TECHNOLOGIES FOR THE REVERB CHALLENGE , 2014 .

[20]  Richard M. Stern,et al.  Likelihood-maximizing beamforming for robust hands-free speech recognition , 2004, IEEE Transactions on Speech and Audio Processing.

[21]  Jasha Droppo,et al.  Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.