Are You Speaking: Real-Time Speech Activity Detection via Landmark Pooling Network

In this paper, we propose a novel visual information based framework to solve the real-time speech activity detection problem. Unlike conventional methods which commonly use the audio signal as input, our approach incorporates facial information into a deep neural network for feature learning. Instead of using the whole input image, we further develop a novel end-to-end landmark pooling network to act as an attention-guide scheme to help the deep neural network only focus the related portion of the input image. This helps the network to precisely and efficiently learn highly discriminative features for speech activities. What’s more, we implement a recurrent neural network with the gated recurrent unit scheme to make use of the sequential information from video to produce the final decision. To give a comprehensive evaluation of the proposed method, we collect a large-scale dataset from unconstrained speech activities, which consists of a large number of speech/non-speech video sequences under various kinds of degradation. Experimental results demonstrate the superiority of our proposed pipeline over previous approach in terms of performance and efficiency.

[1]  Hugo Van hamme,et al.  Who's Speaking?: Audio-Supervised Classification of Active Speakers in Video , 2015, ICMI.

[2]  Samer Al Moubayed,et al.  Towards speaker detection using lips movements for human-machine multiparty dialogue , 2012 .

[3]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[4]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[5]  Andrew Zisserman,et al.  Taking the bite out of automated naming of characters in TV video , 2009, Image Vis. Comput..

[6]  Rajesh M. Hegde,et al.  Active Speaker Detection using audio-visual sensor array , 2014, 2014 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT).

[7]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[10]  Hugo Van hamme,et al.  Active speaker detection with audio-visual co-training , 2016, ICMI.

[11]  Larry S. Davis,et al.  Look who's talking: speaker detection using video and audio correlation , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[12]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[13]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[14]  Ishwar K. Sethi,et al.  Cross-Modal Analysis of Audio-Visual Programs for Speaker Detection , 2005, 2005 IEEE 7th Workshop on Multimedia Signal Processing.