论文信息 - Multi-Task Learning for Audio-Visual Active Speaker Detection

Multi-Task Learning for Audio-Visual Active Speaker Detection

This report describes the approach underlying our submission to the active speaker detection task (task B-2) of ActivityNet Challenge 2019. We introduce a new audio-visual model which builds upon a 3D-ResNet18 visual model pretrained for lipreading and a VGG-M acoustic model pretrained for audio-to-video synchronization. The model is trained with two losses in a multi-task learning fashion: a contrastive loss to enforce matching between audio and video features for active speakers, and a regular crossentropy loss to obtain speaker / non-speaker labels. This model obtains 84.0% mAP on the validation set of AVAActiveSpeaker. Experimental results showcase the pretrained embeddings’ abilities to transfer across tasks and data formats, as well as the advantage of the proposed multi-task learning strategy.

S. Shan | Yuanhang Zhang | Jing-Yun Xiao | Shuang Yang

[1] Joon Son Chung,et al. Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[2] Themos Stafylakis,et al. Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[3] Joon Son Chung,et al. Deep Lip Reading: a comparison of models and an online application , 2018, INTERSPEECH.

[4] Cordelia Schmid,et al. Supplementary Material: AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[5] Joon Son Chung,et al. Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).