Multi-Task Learning for Audio-Visual Active Speaker Detection

This report describes the approach underlying our submission to the active speaker detection task (task B-2) of ActivityNet Challenge 2019. We introduce a new audio-visual model which builds upon a 3D-ResNet18 visual model pretrained for lipreading and a VGG-M acoustic model pretrained for audio-to-video synchronization. The model is trained with two losses in a multi-task learning fashion: a contrastive loss to enforce matching between audio and video features for active speakers, and a regular crossentropy loss to obtain speaker / non-speaker labels. This model obtains 84.0% mAP on the validation set of AVAActiveSpeaker. Experimental results showcase the pretrained embeddings’ abilities to transfer across tasks and data formats, as well as the advantage of the proposed multi-task learning strategy.

[1]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[2]  Themos Stafylakis,et al.  Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[3]  Joon Son Chung,et al.  Deep Lip Reading: a comparison of models and an online application , 2018, INTERSPEECH.

[4]  Cordelia Schmid,et al.  Supplementary Material: AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[5]  Joon Son Chung,et al.  Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).