Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or because it is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing V-VAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild – WildVVAD – based on combining A-VAD with face detection and tracking. A thorough empirical evaluation shows the advantage of training the proposed deep V-VAD models with this dataset.11https://team.inria.fr/perception/research/vvad/

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Radu Horaud,et al.  A Comprehensive Analysis of Deep Regression , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Cláudio Rosito Jung,et al.  Simultaneous-Speaker Voice Activity Detection and Localization Using Mid-Fusion of SVM and HMMs , 2014, IEEE Transactions on Multimedia.

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Sridha Sridharan,et al.  Visual Voice Activity Detection Using Frontal versus Profile Views , 2011, 2011 International Conference on Digital Image Computing: Techniques and Applications.

[6]  Philip J. B. Jackson,et al.  A visual voice activity detection method with adaboosting , 2011 .

[7]  Helge Reikeras,et al.  Audio-visual automatic speech recognition using Dynamic Bayesian Networks , 2011 .

[8]  Christian Jutten,et al.  Two novel visual voice activity detectors based on appearance models and retinal filtering , 2007, 2007 15th European Signal Processing Conference.

[9]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[10]  Daniil Kocharov,et al.  Voice Activity Detector (VAD) Based on Long-Term Mel Frequency Band Features , 2016, TSD.

[11]  Shrikanth Narayanan,et al.  Toward Visual Voice Activity Detection for Unconstrained Videos , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[12]  Ioannis Pitas,et al.  Visual speech detection using mouth region intensities , 2006, 2006 14th European Signal Processing Conference.

[13]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[15]  Cordelia Schmid,et al.  P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[17]  Alexandros Iosifidis,et al.  Visual Voice Activity Detection in the Wild , 2016, IEEE Transactions on Multimedia.

[18]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[19]  Georgios Tzimiropoulos,et al.  How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks) , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[21]  Ioannis Pitas,et al.  Visual Lip Activity Detection and Speaker Detection Using Mouth Region Intensities , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[22]  Peng Liu,et al.  Voice activity detection using visual information , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Juan Manuel Górriz,et al.  Voice Activity Detection. Fundamentals and Speech Recognition System Robustness , 2007 .

[24]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[25]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[26]  Christian Jutten,et al.  An Analysis of Visual Speech Information Applied to Voice Activity Detection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[27]  Bum-Jae You,et al.  Robust visual speakingness detection using bi-level HMM , 2012, Pattern Recognit..

[28]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[29]  Wenwu Wang,et al.  Interference Reduction in Reverberant Speech Separation With Visual Voice Activity Detection , 2014, IEEE Transactions on Multimedia.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.