SynthVSR: Scaling Up Visual Speech RecognitionWith Synthetic Supervision

Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are limited in size. In this paper, for the first time, we study the potential of leveraging synthetic visual data for VSR. Our method, termed SynthVSR, substantially improves the performance of VSR systems with synthetic lip movements. The key idea behind SynthVSR is to leverage a speech-driven lip animation model that generates lip movements conditioned on the input speech. The speech-driven lip animation model is trained on an unlabeled audio-visual dataset and could be further optimized towards a pre-trained VSR model when labeled videos are available. As plenty of transcribed acoustic data and face images are available, we are able to generate large-scale synthetic data using the proposed lip animation model for semi-supervised VSR training. We evaluate the performance of our approach on the largest public VSR benchmark - Lip Reading Sentences 3 (LRS3). SynthVSR achieves a WER of 43.3% with only 30 hours of real labeled data, outperforming off-the-shelf approaches using thousands of hours of video. The WER is further reduced to 27.9% when using all 438 hours of labeled data from LRS3, which is on par with the state-of-the-art self-supervised AV-HuBERT method. Furthermore, when combined with large-scale pseudo-labeled audio-visual data SynthVSR yields a new state-of-the-art VSR WER of 16.9% using publicly available data only, surpassing the recent state-of-the-art approaches trained with 29 times more non-public machine-transcribed video data (90,000 hours). Finally, we perform extensive ablation studies to understand the effect of each component in our proposed method.

[1]  M. Pantic,et al.  Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels , 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  M. Pantic,et al.  Visual speech recognition for multiple languages in the wild , 2022, Nature Machine Intelligence.

[3]  Dmitriy Serdyuk,et al.  Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition , 2022, INTERSPEECH.

[4]  Abdel-rahman Mohamed,et al.  Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction , 2022, ICLR.

[5]  Triantafyllos Afouras,et al.  Sub-word Level Lip Reading With Visual Attention , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Tara N. Sainath,et al.  A Deliberation-Based Joint Acoustic and Text Decoder , 2021, Interspeech.

[8]  Maja Pantic,et al.  LiRA: Learning Visual Speech Representations from Audio through Self-supervision , 2021, Interspeech.

[9]  Roland Maas,et al.  SynthASR: Unlocking Synthetic Data for Speech Recognition , 2021, Interspeech.

[10]  Guoqiang Han,et al.  Learning from the Master: Distilling Cross-modal Advanced Knowledge for Lip Reading , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Chen Change Loy,et al.  Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Maja Pantic,et al.  End-To-End Audio-Visual Speech Recognition with Conformers , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Jorgen Valk,et al.  VOXLINGUA107: A Dataset for Spoken Language Recognition , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[14]  Josh H. McDermott,et al.  ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation , 2020, NeurIPS Datasets and Benchmarks.

[15]  Tie-Yan Liu,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2020, ICLR.

[16]  Leveraging Uni-Modal Self-Supervised Learning for Multimodal Audio-visual Speech Recognition , 2021 .

[17]  Daniel McDuff,et al.  Contrastive Learning of Global and Local Audio-Visual Representations , 2021, ArXiv.

[18]  Chenliang Xu,et al.  Talking-head Generation with Rhythmic Head Motion , 2020, ECCV.

[19]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[20]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[21]  Yandong Guo,et al.  Discriminative Multi-Modality Speech Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Joon Son Chung,et al.  Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision , 2020, INTERSPEECH.

[23]  Wen-mei W. Hwu,et al.  Differential Treatment for Stuff and Things: A Simple Unsupervised Domain Adaptation Method for Semantic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Vladimir Ivashkin,et al.  StyleGAN2 Distillation for Feed-forward Image Manipulation , 2020, ECCV.

[25]  Francis M. Tyers,et al.  Common Voice: A Massively-Multilingual Speech Corpus , 2019, LREC.

[26]  Joon Son Chung,et al.  ASR is All You Need: Cross-Modal Distillation for Lip Reading , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Olivier Siohan,et al.  Recurrent Neural Network Transducer for Audio-Visual Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[28]  Shilin Wang,et al.  Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Andrew Zisserman,et al.  Sim2real transfer learning for 3D human pose estimation: motion to the rescue , 2019, NeurIPS.

[30]  Maja Pantic,et al.  Realistic Speech-Driven Facial Animation with GANs , 2019, International Journal of Computer Vision.

[31]  Stanley T. Birchfield,et al.  Structured Domain Randomization: Bridging the Reality Gap by Context-Aware Synthetic Data , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[32]  Hang Zhou,et al.  Talking Face Generation by Adversarially Disentangled Audio-Visual Representation , 2018, AAAI.

[33]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[34]  Joon Son Chung,et al.  LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.

[35]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[36]  Maja Pantic,et al.  End-to-End Speech-Driven Facial Animation with Temporal GANs , 2018, BMVC.

[37]  Yannick Estève,et al.  TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation , 2018, SPECOM.

[38]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[39]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[40]  Chenliang Xu,et al.  Lip Movements Generation at a Glance , 2018, ECCV.

[41]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[42]  Joon Son Chung,et al.  You said that? , 2017, BMVC.

[43]  Themos Stafylakis,et al.  Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[44]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[45]  Shimon Whiteson,et al.  LipNet: Sentence-level Lipreading , 2016, ArXiv.

[46]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Kate Saenko,et al.  Learning Deep Object Detectors from 3D Models , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[51]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[52]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[53]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.