Naver at ActivityNet Challenge 2019 - Task B Active Speaker Detection (AVA)

This report describes our submission to the ActivityNet Challenge at CVPR 2019. We use a 3D convolutional neural network (CNN) based front-end and an ensemble of temporal convolution and LSTM classifiers to predict whether a visible person is speaking or not. Our results show significant improvements over the baseline on the AVA-ActiveSpeaker dataset.

[1]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[2]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[3]  Cordelia Schmid,et al.  Supplementary Material: AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[4]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[7]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[8]  Tetsuya Ogata,et al.  Audio-visual speech recognition using deep learning , 2014, Applied Intelligence.

[9]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Tinne Tuytelaars,et al.  Cross-Modal Supervision for Learning Active Speaker Detection in Video , 2016, ECCV.

[11]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[12]  Joon Son Chung,et al.  Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..