More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

In this paper we present VDTTS, a V isually- D riven T ext-t o- S peech model. Motivated by dubbing, VDTTS takes advantage of video frames as an additional input alongside text, and generates speech that matches the video signal. We demonstrate how this allows VDTTS to, unlike plain TTS models, generate speech that not only has prosodic variations like natural pauses and pitch, but is also synchronized to the input video. Experimentally, we show our model produces well-synchronized outputs, approaching the video-speech synchronization quality of the ground-truth, on several challenging benchmarks including “in-the-wild” content from VoxCeleb2. Supplementary demo videos demonstrating video-speech synchronization, robustness to speaker ID swapping, and prosody, presented at the project page. 1

[1]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[2]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[3]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[4]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[5]  Tomohiro Nakatani,et al.  A method for fundamental frequency estimation and voicing decision: Application to infant utterances recorded in real acoustical environments , 2008, Speech Commun..

[6]  Abeer Alwan,et al.  Reducing F0 Frame Error of F0 tracking algorithms under noisy conditions with an unvoiced/voiced classification frontend , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  V. Tiwari MFCC and its applications in speaker recognition , 2010 .

[8]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[9]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[12]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[13]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[14]  Shmuel Peleg,et al.  Vid2speech: Speech reconstruction from silent video , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Joon Son Chung,et al.  Lip Reading in Profile , 2017, BMVC.

[16]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[17]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[20]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[21]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[22]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[23]  Joon Son Chung,et al.  LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.

[24]  Yoshua Bengio,et al.  ObamaNet: Photo-realistic lip-sync from text , 2017, ArXiv.

[25]  Quan Wang,et al.  Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Lei He,et al.  Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS , 2019, INTERSPEECH.

[27]  Rohit Jain,et al.  Lipper: Synthesizing Thy Speech using Multi-View Lipreading , 2019, AAAI.

[28]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  C. V. Jawahar,et al.  Towards Automatic Face-to-Face Translation , 2019, ACM Multimedia.

[30]  Heiga Zen,et al.  LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech , 2019, INTERSPEECH.

[31]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[32]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[33]  Jonathan Le Roux,et al.  Universal Sound Separation , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[34]  Adam Finkelstein,et al.  Text-based editing of talking-head video , 2019, ACM Trans. Graph..

[35]  C. V. Jawahar,et al.  Cross-language Speech Dependent Lip-synchronization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Hans-Peter Seidel,et al.  Neural style-preserving visual dubbing , 2019, ACM Trans. Graph..

[37]  Lior Wolf,et al.  Attention-based Wavenet Autoencoder for Universal Voice Conversion , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Thomas Paine,et al.  Large-Scale Visual Speech Recognition , 2018, INTERSPEECH.

[39]  Zhen-Hua Ling,et al.  Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Heiga Zen,et al.  Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.

[41]  Quoc V. Le,et al.  Improved Noisy Student Training for Automatic Speech Recognition , 2020, INTERSPEECH.

[42]  C. V. Jawahar,et al.  A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild , 2020, ACM Multimedia.

[43]  C. V. Jawahar,et al.  Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Misha Denil,et al.  Large-scale multilingual audio visual dubbing , 2020, ArXiv.

[45]  Heiga Zen,et al.  Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling , 2020, ArXiv.

[46]  Tao Qin,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[47]  Shaojin Ding,et al.  Textual Echo Cancellation , 2020, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[48]  Haizhou Li,et al.  Visualtts: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Heiga Zen,et al.  Parallel Tacotron: Non-Autoregressive and Controllable TTS , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Bryan Catanzaro,et al.  DiffWave: A Versatile Diffusion Model for Audio Synthesis , 2020, ICLR.

[51]  Simon King,et al.  An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[52]  Vivek Kwatra,et al.  LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Heiga Zen,et al.  WaveGrad: Estimating Gradients for Waveform Generation , 2020, ICLR.

[54]  Ye Jia,et al.  Translatotron 2: Robust direct speech-to-speech translation , 2021, ArXiv.

[55]  Yonghui Wu,et al.  PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS , 2021, Interspeech.

[56]  Yuxuan Wang,et al.  Neural Dubber: Dubbing for Silent Videos According to Scripts , 2021, ArXiv.

[57]  Heiga Zen,et al.  Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling , 2021, Interspeech.

[58]  Marco Tagliasacchi,et al.  SoundStream: An End-to-End Neural Audio Codec , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[59]  Chen Change Loy,et al.  Everybody’s Talkin’: Let Me Talk as You Want , 2020, IEEE Transactions on Information Forensics and Security.

[60]  Maja Pantic,et al.  End-to-End Video-To-Speech Synthesis using Generative Adversarial Networks , 2021, IEEE transactions on cybernetics.