ASR is All You Need: Cross-Modal Distillation for Lip Reading

The goal of this work is to train strong models for visual speech recognition without requiring human annotated ground truth data. We achieve this by distilling from an Automatic Speech Recognition (ASR) model that has been trained on a large-scale audio-only corpus. We use a cross-modal distillation method that combines Connectionist Temporal Classification (CTC) with a frame-wise cross-entropy loss. Our contributions are fourfold: (i) we show that ground truth transcriptions are not necessary to train a lip reading system; (ii) we show how arbitrary amounts of unlabelled video data can be leveraged to improve performance; (iii) we demonstrate that distillation significantly speeds up training; and, (iv) we obtain state-of-the-art results on the challenging LRS2 and LRS3 datasets for training only on publicly available data.

[1]  Olivier Siohan,et al.  Recurrent Neural Network Transducer for Audio-Visual Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[2]  Shilin Wang,et al.  Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Haisong Ding,et al.  Compression of CTC-Trained Acoustic Models by Dynamic Frame-Wise Distillation or Segment-Wise N-Best Hypotheses Imitation , 2019, INTERSPEECH.

[4]  Timo Baumann,et al.  The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening , 2019, Lang. Resour. Evaluation.

[5]  Hisashi Kawai,et al.  Investigation of Sequence-level Knowledge Distillation Methods for CTC Acoustic Models , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Ho-Gyeong Kim,et al.  Knowledge Distillation Using Output Errors for Self-attention End-to-end Models , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Chin-Hui Lee,et al.  Improving Audio-visual Speech Recognition Performance with Cross-modal Student-teacher Training , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Kartik Audhkhasi,et al.  Guiding CTC Posterior Spike Timings for Improved Posterior Fusion and Knowledge Distillation , 2019, INTERSPEECH.

[9]  Boris Ginsburg,et al.  Jasper: An End-to-End Convolutional Neural Acoustic Model , 2019, INTERSPEECH.

[10]  Maja Pantic,et al.  Audio-Visual Speech Recognition with a Hybrid CTC/Attention Architecture , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[11]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Joon Son Chung,et al.  LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.

[13]  Andrew Zisserman,et al.  Emotion Recognition in Speech using Cross-Modal Transfer in the Wild , 2018, ACM Multimedia.

[14]  Thomas Paine,et al.  Large-Scale Visual Speech Recognition , 2018, INTERSPEECH.

[15]  Joon Son Chung,et al.  Deep Lip Reading: a comparison of models and an online application , 2018, INTERSPEECH.

[16]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[17]  Antonio Torralba,et al.  Through-Wall Human Pose Estimation Using Radio Signals , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Boris Ginsburg,et al.  Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq , 2018, 1805.10387.

[19]  W. Freeman,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[20]  Gabriel Synnaeve,et al.  Letter-Based Speech Recognition with Gated ConvNets , 2017, ArXiv.

[21]  Tara N. Sainath,et al.  An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Jinyu Li,et al.  Improved training for online end-to-end speech recognition systems , 2017, INTERSPEECH.

[23]  Themos Stafylakis,et al.  Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[24]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[25]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Shimon Whiteson,et al.  LipNet: Sentence-level Lipreading , 2016, ArXiv.

[27]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[28]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[29]  Alexander M. Rush,et al.  Sequence-Level Knowledge Distillation , 2016, EMNLP.

[30]  Tara N. Sainath,et al.  Acoustic modelling with CD-CTC-SMBR LSTM RNNS , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[31]  Jitendra Malik,et al.  Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[34]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[35]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[36]  Daniel Jurafsky,et al.  Lexicon-Free Conversational Speech Recognition with Neural Networks , 2015, NAACL.

[37]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.