End-to-End Lip Synchronisation

The goal of this work is to synchronise audio and video of a talking face using deep neural network models. Existing works have trained networks on proxy tasks such as cross-modal similarity learning, and then computed similarities between audio and video frames using a sliding window approach. While these methods demonstrate satisfactory performance, the networks are not trained directly on the task. To this end, we propose an end-to-end trained network that can directly predict the offset between an audio stream and the corresponding video stream. The similarity matrix between the two modalities is first computed from the features, then the inference of the offset can be considered to be a pattern recognition problem where the matrix is considered equivalent to an image. The feature extractor and the classifier are trained jointly. We demonstrate that the proposed approach outperforms the previous work by a large margin on LRS2 and LRS3 datasets.

[1]  Peter Jancovic,et al.  Multi-modal egocentric activity recognition using multi-kernel learning , 2018, Multimedia Tools and Applications.

[2]  Satoshi Nakamura,et al.  Audio-visual speech translation with automatic lip syncqronization and face tracking based on 3-D head model , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Tae-Hyun Oh,et al.  On Learning Associations of Faces and Voices , 2018, ACCV.

[4]  Joon Son Chung,et al.  Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[6]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[7]  Amit K. Roy-Chowdhury,et al.  Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval , 2018, ICMR.

[8]  Jung-Woo Ha,et al.  NSML: A Machine Learning Platform That Enables You to Focus on Your Models , 2017, ArXiv.

[9]  Tinne Tuytelaars,et al.  Cross-Modal Supervision for Learning Active Speaker Detection in Video , 2016, ECCV.

[10]  Qin Jin,et al.  Video Description Generation using Audio and Visual Cues , 2016, ICMR.

[11]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[12]  Vladimir Pavlovic,et al.  Audio-visual speaker detection using dynamic Bayesian networks , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[13]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Joon Son Chung,et al.  LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.

[15]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[16]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[17]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[18]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[19]  Ting Yao,et al.  Deep Learning for Video Classification and Captioning , 2016, Frontiers of Multimedia Research.

[20]  Joon Son Chung,et al.  Disentangled Speech Embeddings Using Cross-Modal Self-Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Lorenzo Torresani,et al.  Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[22]  Joon Son Chung,et al.  ASR is All You Need: Cross-Modal Distillation for Lip Reading , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Tae-Hyun Oh,et al.  Listen to Look: Action Recognition by Previewing Audio , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Shimon Whiteson,et al.  LipNet: Sentence-level Lipreading , 2016, ArXiv.

[25]  Elif Surer,et al.  Multi-modal Egocentric Activity Recognition using Audio-Visual Features , 2018, ArXiv.

[26]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[27]  Andrew Zisserman,et al.  Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Joon Son Chung,et al.  My lips are concealed: Audio-visual speech enhancement through obstructions , 2019, INTERSPEECH.

[29]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[30]  Joon Son Chung,et al.  Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision , 2020, INTERSPEECH.

[31]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[33]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[34]  Joon Son Chung,et al.  Perfect Match: Self-Supervised Embeddings for Cross-Modal Retrieval , 2020, IEEE Journal of Selected Topics in Signal Processing.

[35]  Hugo Van hamme,et al.  Active speaker detection with audio-visual co-training , 2016, ICMI.

[36]  Shmuel Peleg,et al.  Dynamic Temporal Alignment of Speech to Lips , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[38]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[39]  Andrew Zisserman,et al.  Learnable PINs: Cross-Modal Embeddings for Person Identity , 2018, ECCV.

[40]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.