Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection

Voice activity detection (VAD) is an important preprocessing step in speech-based systems, especially for emerging handfree intelligent assistants. Conventional VAD systems relying on audio-only features are normally impaired by noise in the environment. An alternative approach to address this problem is audiovisual VAD (AV-VAD) systems. Modeling timing dependencies between acoustic and visual features is a challenge in AV-VAD. This study proposes a bimodal recurrent neural network (RNN) which combines audiovisual features in a principled, unified framework, capturing the timing dependency within modalities and across modalities. Each modality is modeled with separate bidirectional long short-term memory (BLSTM) networks. The output layers are used as input of another BLSTM network. The experimental evaluation considers a large audiovisual corpus with clean and noisy recordings to assess the robustness of the approach. The proposed approach outperforms audio-only VAD by 7.9% (absolute) under clean/ideal conditions (i.e., high definition (HD) camera, close-talk microphone). The proposed solution outperforms the audio-only VAD system by 18.5% (absolute) when the conditions are more challenging (i.e., camera and microphone from a tablet with noise in the environment). The proposed approach shows the best performance and robustness across a varieties of conditions, demonstrating its potential for real-world applications.

[1]  Petros Maragos,et al.  Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Israel Cohen,et al.  Audio-Visual Voice Activity Detection Using Diffusion Maps , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  DeLiang Wang,et al.  Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  DeLiang Wang,et al.  Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection , 2014, INTERSPEECH.

[5]  Christian Jutten,et al.  Two novel visual voice activity detectors based on appearance models and retinal filtering , 2007, 2007 15th European Signal Processing Conference.

[6]  Björn W. Schuller,et al.  Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  John H. L. Hansen,et al.  Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux , 2013, IEEE Signal Processing Letters.

[9]  Aristodemos Pnevmatikakis,et al.  Voice activity detection using audio-visual information , 2009, 2009 16th International Conference on Digital Signal Processing.

[10]  John H. L. Hansen,et al.  An unsupervised visual-only voice activity detection approach using temporal orofacial features , 2015, INTERSPEECH.

[11]  Thad Hughes,et al.  Recurrent neural networks for voice activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Peng Liu,et al.  Voice activity detection using visual information , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  John H. L. Hansen,et al.  Improving Boundary Estimation in Audiovisual Speech Activity Detection Using Bayesian Information Criterion , 2016, INTERSPEECH.

[14]  Xiao-Lei Zhang,et al.  Deep Belief Networks Based Voice Activity Detection , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Emiel Krahmer,et al.  Visual voice activity detection at different speeds , 2013, AVSP.

[16]  Carlos Busso,et al.  Lipreading approach for isolated digits recognition under whisper and neutral speech , 2014, INTERSPEECH.

[17]  Tara N. Sainath,et al.  Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection , 2016, INTERSPEECH.

[18]  Israel Cohen,et al.  Adaptive weighting parameter in audio-visual voice activity detection , 2016, 2016 IEEE International Conference on the Science of Electrical Engineering (ICSEE).

[19]  Yoni Bauduin,et al.  Audio-Visual Speech Recognition , 2004 .

[20]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[21]  Satoshi Tamura,et al.  Voice activity detection based on fusion of audio and visual information , 2009, AVSP.

[22]  Mark Liberman,et al.  Speech activity detection on youtube using deep neural networks , 2013, INTERSPEECH.

[23]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[24]  Paul Over,et al.  Creating HAVIC: Heterogeneous Audio Visual Internet Collection , 2012, LREC.

[25]  Ben P. Milner,et al.  Using audio-visual features for robust voice activity detection in clean and noisy speech , 2008, 2008 16th European Signal Processing Conference.

[26]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  John H. L. Hansen,et al.  Audio-visual isolated digit recognition for whispered speech , 2011, 2011 19th European Signal Processing Conference.