论文信息 - Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection

Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection

Voice activity detection (VAD) is an important preprocessing step in speech-based systems, especially for emerging handfree intelligent assistants. Conventional VAD systems relying on audio-only features are normally impaired by noise in the environment. An alternative approach to address this problem is audiovisual VAD (AV-VAD) systems. Modeling timing dependencies between acoustic and visual features is a challenge in AV-VAD. This study proposes a bimodal recurrent neural network (RNN) which combines audiovisual features in a principled, unified framework, capturing the timing dependency within modalities and across modalities. Each modality is modeled with separate bidirectional long short-term memory (BLSTM) networks. The output layers are used as input of another BLSTM network. The experimental evaluation considers a large audiovisual corpus with clean and noisy recordings to assess the robustness of the approach. The proposed approach outperforms audio-only VAD by 7.9% (absolute) under clean/ideal conditions (i.e., high definition (HD) camera, close-talk microphone). The proposed solution outperforms the audio-only VAD system by 18.5% (absolute) when the conditions are more challenging (i.e., camera and microphone from a tablet with noise in the environment). The proposed approach shows the best performance and robustness across a varieties of conditions, demonstrating its potential for real-world applications.

Carlos Busso | Fei Tao | C. Busso | Fei Tao

[1] Petros Maragos,et al. Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[2] Israel Cohen,et al. Audio-Visual Voice Activity Detection Using Diffusion Maps , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3] DeLiang Wang,et al. Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4] DeLiang Wang,et al. Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection , 2014, INTERSPEECH.

[5] Christian Jutten,et al. Two novel visual voice activity detectors based on appearance models and retinal filtering , 2007, 2007 15th European Signal Processing Conference.

[6] Björn W. Schuller,et al. Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8] John H. L. Hansen,et al. Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux , 2013, IEEE Signal Processing Letters.

[9] Aristodemos Pnevmatikakis,et al. Voice activity detection using audio-visual information , 2009, 2009 16th International Conference on Digital Signal Processing.

[10] John H. L. Hansen,et al. An unsupervised visual-only voice activity detection approach using temporal orofacial features , 2015, INTERSPEECH.

[11] Thad Hughes,et al. Recurrent neural networks for voice activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12] Peng Liu,et al. Voice activity detection using visual information , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13] John H. L. Hansen,et al. Improving Boundary Estimation in Audiovisual Speech Activity Detection Using Bayesian Information Criterion , 2016, INTERSPEECH.

[14] Xiao-Lei Zhang,et al. Deep Belief Networks Based Voice Activity Detection , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[15] Emiel Krahmer,et al. Visual voice activity detection at different speeds , 2013, AVSP.

[16] Carlos Busso,et al. Lipreading approach for isolated digits recognition under whisper and neutral speech , 2014, INTERSPEECH.

[17] Tara N. Sainath,et al. Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection , 2016, INTERSPEECH.

[18] Israel Cohen,et al. Adaptive weighting parameter in audio-visual voice activity detection , 2016, 2016 IEEE International Conference on the Science of Electrical Engineering (ICSEE).

[19] Yoni Bauduin,et al. Audio-Visual Speech Recognition , 2004 .

[20] Yoshua. Bengio,et al. Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[21] Satoshi Tamura,et al. Voice activity detection based on fusion of audio and visual information , 2009, AVSP.

[22] Mark Liberman,et al. Speech activity detection on youtube using deep neural networks , 2013, INTERSPEECH.

[23] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[24] Paul Over,et al. Creating HAVIC: Heterogeneous Audio Visual Internet Collection , 2012, LREC.

[25] Ben P. Milner,et al. Using audio-visual features for robust voice activity detection in clean and noisy speech , 2008, 2008 16th European Signal Processing Conference.

[26] Yoshua Bengio,et al. End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] John H. L. Hansen,et al. Audio-visual isolated digit recognition for whispered speech , 2011, 2011 19th European Signal Processing Conference.