论文信息 - Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or because it is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing V-VAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild – WildVVAD – based on combining A-VAD with face detection and tracking. A thorough empirical evaluation shows the advantage of training the proposed deep V-VAD models with this dataset.11https://team.inria.fr/perception/research/vvad/

Radu Horaud | Pablo Mesejo | St'ephane Lathuiliere | Sylvain Guy

[1] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2] Radu Horaud,et al. A Comprehensive Analysis of Deep Regression , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3] Cláudio Rosito Jung,et al. Simultaneous-Speaker Voice Activity Detection and Localization Using Mid-Fusion of SVM and HMMs , 2014, IEEE Transactions on Multimedia.

[4] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5] Sridha Sridharan,et al. Visual Voice Activity Detection Using Frontal versus Profile Views , 2011, 2011 International Conference on Digital Image Computing: Techniques and Applications.

[6] Philip J. B. Jackson,et al. A visual voice activity detection method with adaboosting , 2011 .

[7] Helge Reikeras,et al. Audio-visual automatic speech recognition using Dynamic Bayesian Networks , 2011 .

[8] Christian Jutten,et al. Two novel visual voice activity detectors based on appearance models and retinal filtering , 2007, 2007 15th European Signal Processing Conference.

[9] Apostol Natsev,et al. YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[10] Daniil Kocharov,et al. Voice Activity Detector (VAD) Based on Long-Term Mel Frequency Band Features , 2016, TSD.

[11] Shrikanth Narayanan,et al. Toward Visual Voice Activity Detection for Unconstrained Videos , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[12] Ioannis Pitas,et al. Visual speech detection using mouth region intensities , 2006, 2006 14th European Signal Processing Conference.

[13] J.N. Gowdy,et al. CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[15] Cordelia Schmid,et al. P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16] Jürgen Schmidhuber,et al. Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[17] Alexandros Iosifidis,et al. Visual Voice Activity Detection in the Wild , 2016, IEEE Transactions on Multimedia.

[18] Juergen Luettin,et al. Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[19] Georgios Tzimiropoulos,et al. How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks) , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20] Cordelia Schmid,et al. Action recognition by dense trajectories , 2011, CVPR 2011.

[21] Ioannis Pitas,et al. Visual Lip Activity Detection and Speaker Detection Using Mouth Region Intensities , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[22] Peng Liu,et al. Voice activity detection using visual information , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23] Juan Manuel Górriz,et al. Voice Activity Detection. Fundamentals and Speech Recognition System Robustness , 2007 .

[24] Davis E. King,et al. Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[25] Ivan Laptev,et al. On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[26] Christian Jutten,et al. An Analysis of Visual Speech Information Applied to Voice Activity Detection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[27] Bum-Jae You,et al. Robust visual speakingness detection using bi-level HMM , 2012, Pattern Recognit..

[28] Jon Barker,et al. An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[29] Wenwu Wang,et al. Interference Reduction in Reverberant Speech Separation With Visual Voice Activity Detection , 2014, IEEE Transactions on Multimedia.

[30] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.