Temporal Modeling Using Dilated Convolution and Gating for Voice-Activity-Detection

Voice activity detection (VAD) is the task of predicting which parts of an utterance contains speech versus background noise. It is an important first step to determine which samples to send to the decoder and when to close the microphone. The long short-term memory neural network (LSTM) is a popular architecture for sequential modeling of acoustic signals, and has been successfully used in several VAD applications. However, it has been observed that LSTMs suffer from state saturation problems when the utterance is long (i.e., for voice dictation tasks), and thus requires the LSTM state to be periodically reset. In this paper, we propose an alternative architecture that does not suffer from saturation problems by modeling temporal variations through a stateless dilated convolution neural network (CNN). The proposed architecture differs from conventional CNNs in three respects: it uses dilated causal convolution, gated activations and residual connections. Results on a Google Voice Typing task shows that the proposed architecture achieves 14% relative FA improvement at a FR of 1% over state-of-the-art LSTMs for VAD task. We also include detailed experiments investigating the factors that distinguish the proposed architecture from conventional convolution.

[1]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Matt Shannon,et al.  Improved End-of-Query Detection for Streaming Speech Recognition , 2017, INTERSPEECH.

[3]  Navdeep Jaitly,et al.  Speech recognition for medical conversations , 2017, INTERSPEECH.

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Tara N. Sainath,et al.  Highway-LSTM and Recurrent Highway Networks for Speech Recognition , 2017, INTERSPEECH.

[6]  P. Dutilleux An Implementation of the “algorithme à trous” to Compute the Wavelet Transform , 1989 .

[7]  Björn W. Schuller,et al.  Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[10]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[11]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[12]  Tara N. Sainath,et al.  Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection , 2016, INTERSPEECH.

[13]  Brian Kingsbury,et al.  Improvements to the IBM speech activity detection system for the DARPA RATS program , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Tara N. Sainath,et al.  Endpoint Detection Using Grid Long Short-Term Memory Networks for Streaming Speech Recognition , 2017, INTERSPEECH.

[15]  Tara N. Sainath,et al.  Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[16]  Fei Xie,et al.  A comparative study of speech detection methods , 1997, EUROSPEECH.

[17]  Yun Lei,et al.  All for one: feature combination for highly channel-degraded speech activity detection , 2013, INTERSPEECH.

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[20]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[21]  Tara N. Sainath,et al.  Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[22]  Ph. Tchamitchian,et al.  Wavelets: Time-Frequency Methods and Phase Space , 1992 .

[23]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[24]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[25]  Geoffrey E. Hinton,et al.  A time-delay neural network architecture for isolated word recognition , 1990, Neural Networks.