Reducing the Computational Complexity of Multimicrophone Acoustic Models with Integrated Feature Extraction

Recently, we presented a multichannel neural network model trained to perform speech enhancement jointly with acoustic modeling [1], directly from raw waveform input signals. While this model achieved over a 10% relative improvement compared to a single channel model, it came at a large cost in computational complexity, particularly in the convolutions used to implement a time-domain filterbank. In this paper we present several different approaches to reduce the complexity of this model by reducing the stride of the convolution operation and by implementing filters in the frequency domain. These optimizations reduce the computational complexity of the model by a factor of 3 with no loss in accuracy on a 2,000 hour Voice Search task.

[1]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[2]  Andreas Stolcke,et al.  The SRI-ICSI Spring 2007 Meeting and Lecture Recognition System , 2007, CLEAR.

[3]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[4]  Tara N. Sainath,et al.  Complex Linear Projection (CLP): A Discriminative Approach to Joint Feature Extraction and Acoustic Modeling , 2016, INTERSPEECH.

[5]  Tara N. Sainath,et al.  Factored spatial and spectral multichannel raw waveform CLDNNs , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Boaz Rafaely,et al.  Microphone Array Signal Processing , 2008 .

[7]  Ron J. Weiss,et al.  Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[9]  Richard M. Stern,et al.  Likelihood-maximizing beamforming for robust hands-free speech recognition , 2004, IEEE Transactions on Speech and Audio Processing.

[10]  Tara N. Sainath,et al.  Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[11]  Tara N. Sainath,et al.  Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[12]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Tara N. Sainath,et al.  Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks , 2016, INTERSPEECH.

[14]  Lukás Burget,et al.  Transcribing Meetings With the AMIDA Systems , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[16]  Masakiyo Fujimoto,et al.  LINEAR PREDICTION-BASED DEREVERBERATION WITH ADVANCED SPEECH ENHANCEMENT AND RECOGNITION TECHNOLOGIES FOR THE REVERB CHALLENGE , 2014 .

[17]  Georg Heigold,et al.  Asynchronous stochastic optimization for sequence training of deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).