论文信息 - UniCon: Unified Context Network for Robust Active Speaker Detection

UniCon: Unified Context Network for Robust Active Speaker Detection

We propose a new efficient framework, the Unified Context Network (UniCon), for robust active speaker detection (ASD). Traditional methods for ASD usually operate on each candidate's pre-cropped face track separately and do not sufficiently consider the relationships among the candidates. This potentially limits performance, especially in challenging scenarios with low-resolution faces, multiple candidates, etc. Our solution is a novel, unified framework that focuses on jointly modeling multiple types of contextual information: spatial context to indicate the position and scale of each candidate's face, relational context to capture the visual relationships among the candidates and contrast audio-visual affinities with each other, and temporal context to aggregate long-term information and smooth out local uncertainties. Based on such information, our model optimizes all candidates in a unified process for robust and reliable ASD. A thorough ablation study is performed on several challenging ASD benchmarks under different settings. In particular, our method outperforms the state-of-the-art by a large margin of about 15% mean Average Precision (mAP) absolute on two challenging subsets: one with three candidate speakers, and the other with faces smaller than 64 pixels. Together, our UniCon achieves 92.0% mAP on the AVA-ActiveSpeaker validation set, surpassing 90% for the first time on this challenging dataset at the time of submission. Project website: https://unicon-asd.github.io/.

[1] Andrew Owens,et al. Self-Supervised Learning of Audio-Visual Objects from Video , 2020, ECCV.

[2] Radu Horaud,et al. Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3] Joon Son Chung,et al. Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5] Kevin Wilson,et al. Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[6] Masafumi Nishida,et al. Turn-alignment using eye-gaze and speech in conversational interaction , 2010, INTERSPEECH.

[7] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Jason Weston,et al. Curriculum learning , 2009, ICML '09.

[9] Vittorio Murino,et al. Voice Activity Detection by Upper Body Motion Analysis and Unsupervised Domain Adaptation , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[10] Irene Kotsia,et al. RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Rainer Lienhart,et al. Reliable Transition Detection in Videos: A Survey and Practitioner's Guide , 2001, Int. J. Image Graph..

[12] Joon Son Chung,et al. Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[13] Bernard Ghanem,et al. Active Speakers in Context , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Larry S. Davis,et al. Look who's talking: speaker detection using video and audio correlation , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[15] Hervé Bourlard,et al. Audio-visual synchronisation for speaker diarisation , 2010, INTERSPEECH.

[16] Vittorio Murino,et al. RealVAD: A Real-World Dataset and A Method for Voice Activity Detection by Body Motion Analysis , 2021, IEEE Transactions on Multimedia.

[17] Adam Kirk,et al. Multimodal Active Speaker Detection and Virtual Cinematography for Video Conferencing , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Joon Son Chung,et al. VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[19] Tinne Tuytelaars,et al. Cross-Modal Supervision for Learning Active Speaker Detection in Video , 2016, ECCV.

[20] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[21] Andrew Zisserman,et al. LAEO-Net: Revisiting People Looking at Each Other in Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Kazuhito Koishida,et al. Improved Active Speaker Detection based on Optical Flow , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[23] Arkadiusz Stopczynski,et al. Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] Vittorio Murino,et al. S-VVAD: Visual Voice Activity Detection by Motion Segmentation , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[25] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[26] Francisco Madrigal,et al. Audio-Video detection of the active speaker in meetings , 2021, 2020 25th International Conference on Pattern Recognition (ICPR).

[27] Radu Horaud,et al. Active-speaker detection and localization with microphones and cameras embedded into a robotic head , 2013, 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids).

[28] Gregory Gelly,et al. Improving Speaker Diarization of TV Series using Talking-Face Detection and Clustering , 2016, ACM Multimedia.

[29] Andrew Zisserman,et al. Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Dong Wang,et al. CN-Celeb: A Challenging Chinese Speaker Recognition Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Malcolm Slaney,et al. Using audio-visual information to understand speaker activity: Tracking active speakers on and off screen , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32] S. Shan,et al. Multi-Task Learning for Audio-Visual Active Speaker Detection , 2019 .

[33] Yong Xu,et al. Self-Supervised Learning for Audio-Visual Speaker Diarization , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34] Joon Son Chung,et al. Spot the conversation: speaker diarisation in the wild , 2020, INTERSPEECH.

[35] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[36] Malcolm Slaney,et al. FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[37] Jean-Marc Odobez,et al. Investigating the use of visual focus of attention for audio-visual speaker diarisation , 2009, MM '09.

[38] A. Kendon. Some functions of gaze-direction in social interaction. , 1967, Acta psychologica.

[39] Zheng Shou,et al. Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Joon Son Chung. Naver at ActivityNet Challenge 2019 - Task B Active Speaker Detection (AVA) , 2019, ArXiv.

[42] Zhanghui Kuang,et al. Context-Aware RCNN: A Baseline for Action Detection in Videos , 2020, ECCV.