UniCon: Unified Context Network for Robust Active Speaker Detection

We propose a new efficient framework, the Unified Context Network (UniCon), for robust active speaker detection (ASD). Traditional methods for ASD usually operate on each candidate's pre-cropped face track separately and do not sufficiently consider the relationships among the candidates. This potentially limits performance, especially in challenging scenarios with low-resolution faces, multiple candidates, etc. Our solution is a novel, unified framework that focuses on jointly modeling multiple types of contextual information: spatial context to indicate the position and scale of each candidate's face, relational context to capture the visual relationships among the candidates and contrast audio-visual affinities with each other, and temporal context to aggregate long-term information and smooth out local uncertainties. Based on such information, our model optimizes all candidates in a unified process for robust and reliable ASD. A thorough ablation study is performed on several challenging ASD benchmarks under different settings. In particular, our method outperforms the state-of-the-art by a large margin of about 15% mean Average Precision (mAP) absolute on two challenging subsets: one with three candidate speakers, and the other with faces smaller than 64 pixels. Together, our UniCon achieves 92.0% mAP on the AVA-ActiveSpeaker validation set, surpassing 90% for the first time on this challenging dataset at the time of submission. Project website: https://unicon-asd.github.io/.

[1]  Andrew Owens,et al.  Self-Supervised Learning of Audio-Visual Objects from Video , 2020, ECCV.

[2]  Radu Horaud,et al.  Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Joon Son Chung,et al.  Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[6]  Masafumi Nishida,et al.  Turn-alignment using eye-gaze and speech in conversational interaction , 2010, INTERSPEECH.

[7]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[9]  Vittorio Murino,et al.  Voice Activity Detection by Upper Body Motion Analysis and Unsupervised Domain Adaptation , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[10]  Irene Kotsia,et al.  RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Rainer Lienhart,et al.  Reliable Transition Detection in Videos: A Survey and Practitioner's Guide , 2001, Int. J. Image Graph..

[12]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[13]  Bernard Ghanem,et al.  Active Speakers in Context , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Larry S. Davis,et al.  Look who's talking: speaker detection using video and audio correlation , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[15]  Hervé Bourlard,et al.  Audio-visual synchronisation for speaker diarisation , 2010, INTERSPEECH.

[16]  Vittorio Murino,et al.  RealVAD: A Real-World Dataset and A Method for Voice Activity Detection by Body Motion Analysis , 2021, IEEE Transactions on Multimedia.

[17]  Adam Kirk,et al.  Multimodal Active Speaker Detection and Virtual Cinematography for Video Conferencing , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[19]  Tinne Tuytelaars,et al.  Cross-Modal Supervision for Learning Active Speaker Detection in Video , 2016, ECCV.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Andrew Zisserman,et al.  LAEO-Net: Revisiting People Looking at Each Other in Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Kazuhito Koishida,et al.  Improved Active Speaker Detection based on Optical Flow , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[23]  Arkadiusz Stopczynski,et al.  Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Vittorio Murino,et al.  S-VVAD: Visual Voice Activity Detection by Motion Segmentation , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[25]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[26]  Francisco Madrigal,et al.  Audio-Video detection of the active speaker in meetings , 2021, 2020 25th International Conference on Pattern Recognition (ICPR).

[27]  Radu Horaud,et al.  Active-speaker detection and localization with microphones and cameras embedded into a robotic head , 2013, 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids).

[28]  Gregory Gelly,et al.  Improving Speaker Diarization of TV Series using Talking-Face Detection and Clustering , 2016, ACM Multimedia.

[29]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Dong Wang,et al.  CN-Celeb: A Challenging Chinese Speaker Recognition Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Malcolm Slaney,et al.  Using audio-visual information to understand speaker activity: Tracking active speakers on and off screen , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  S. Shan,et al.  Multi-Task Learning for Audio-Visual Active Speaker Detection , 2019 .

[33]  Yong Xu,et al.  Self-Supervised Learning for Audio-Visual Speaker Diarization , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Joon Son Chung,et al.  Spot the conversation: speaker diarisation in the wild , 2020, INTERSPEECH.

[35]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[36]  Malcolm Slaney,et al.  FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[37]  Jean-Marc Odobez,et al.  Investigating the use of visual focus of attention for audio-visual speaker diarisation , 2009, MM '09.

[38]  A. Kendon Some functions of gaze-direction in social interaction. , 1967, Acta psychologica.

[39]  Zheng Shou,et al.  Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Joon Son Chung Naver at ActivityNet Challenge 2019 - Task B Active Speaker Detection (AVA) , 2019, ArXiv.

[42]  Zhanghui Kuang,et al.  Context-Aware RCNN: A Baseline for Action Detection in Videos , 2020, ECCV.