Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition

Multichannel automatic speech recognition (ASR) systems commonly separate speech enhancement, including localization, beamforming, and postfiltering, from acoustic modeling. In this paper, we perform multichannel enhancement jointly with acoustic modeling in a deep neural network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture, which performs multichannel filtering in the first layer of the network, and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. Next, we show how performance can be improved by factoring the first layer to separate the multichannel spatial filtering operation from a single channel filterbank which computes a frequency decomposition. We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. Finally, we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5% compared to a traditional beamforming-based multichannel ASR system and more than 10% compared to a single channel waveform model.

[1]  Ronald N. Bracewell,et al.  The Fourier Transform and Its Applications , 1966 .

[2]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[3]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[4]  L. J. Griffiths,et al.  An alternative approach to linearly constrained adaptive beamforming , 1982 .

[5]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[6]  Kung Yao,et al.  Broadband maximum energy array with user imposed spatial and frequency constraints , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[8]  Richard M. Stern,et al.  Likelihood-maximizing beamforming for robust hands-free speech recognition , 2004, IEEE Transactions on Speech and Audio Processing.

[9]  Andreas Stolcke,et al.  The SRI-ICSI Spring 2007 Meeting and Lecture Recognition System , 2007, CLEAR.

[10]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[11]  Marc Moonen,et al.  Superdirective Beamforming Robust Against Microphone Mismatch , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Boaz Rafaely,et al.  Microphone Array Signal Processing , 2008 .

[13]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[14]  Geoffrey E. Hinton,et al.  Learning a better representation of speech soundwaves using restricted boltzmann machines , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Geoffrey E. Hinton,et al.  Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[17]  Lukás Burget,et al.  Transcribing Meetings With the AMIDA Systems , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Dimitri Palaz,et al.  Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks , 2013, INTERSPEECH.

[19]  Ebru Arisoy,et al.  Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Tara N. Sainath,et al.  Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[21]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[22]  Jonathan Le Roux,et al.  Deep Unfolding: Model-Based Inspiration of Novel Deep Architectures , 2014, ArXiv.

[23]  Yu Zhang,et al.  Extracting deep neural network bottleneck features using low-rank matrix factorization , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Masakiyo Fujimoto,et al.  LINEAR PREDICTION-BASED DEREVERBERATION WITH ADVANCED SPEECH ENHANCEMENT AND RECOGNITION TECHNOLOGIES FOR THE REVERB CHALLENGE , 2014 .

[25]  Hermann Ney,et al.  Acoustic modeling with deep neural networks using raw time signal for LVCSR , 2014, INTERSPEECH.

[26]  Georg Heigold,et al.  Asynchronous stochastic optimization for sequence training of deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  John R. Hershey,et al.  Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks , 2015, INTERSPEECH.

[29]  Ron J. Weiss,et al.  Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Haizhou Li,et al.  A learning-based approach to direction of arrival estimation in noisy and reverberant environments , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Tara N. Sainath,et al.  Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[32]  Jasha Droppo,et al.  Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Yoshua Bengio,et al.  Gated Feedback Recurrent Neural Networks , 2015, ICML.

[34]  Michael I. Mandel,et al.  Combining spectral feature mapping and multi-channel model-based source separation for noise-robust automatic speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[35]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[37]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[38]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[39]  Tara N. Sainath,et al.  Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks , 2016, INTERSPEECH.

[40]  Tara N. Sainath,et al.  Reducing the Computational Complexity of Multimicrophone Acoustic Models with Integrated Feature Extraction , 2016, INTERSPEECH.

[41]  Tara N. Sainath,et al.  Factored spatial and spectral multichannel raw waveform CLDNNs , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Liang Lu,et al.  Deep beamforming networks for multi-channel speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Tara N. Sainath,et al.  Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition , 2016, INTERSPEECH.

[44]  Tara N. Sainath,et al.  Complex Linear Projection (CLP): A Discriminative Approach to Joint Feature Extraction and Acoustic Modeling , 2016, INTERSPEECH.