Learning to lip read words by watching videos

Abstract Our aim is to recognise the words being spoken by a talking face, given only the video but not the audio. Existing works in this area have focussed on trying to recognise a small number of utterances in controlled environments (e.g. digits and alphabets), partially due to the shortage of suitable datasets. We make three novel contributions: first, we develop a pipeline for fully automated data collection from TV broadcasts. With this we have generated a dataset with over a million word instances, spoken by over a thousand different people; second, we develop a two-stream convolutional neural network that learns a joint embedding between the sound and the mouth motions from unlabelled data. We apply this network to the tasks of audio-to-video synchronisation and active speaker detection; third, we train convolutional and recurrent networks that are able to effectively learn and recognize hundreds of words from this large-scale dataset. In lip reading and in speaker detection, we demonstrate results that exceed the current state-of-the-art on public benchmark datasets.

[1]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Hongbin Zha,et al.  Unsupervised Random Forest Manifold Alignment for Lipreading , 2013, 2013 IEEE International Conference on Computer Vision.

[3]  Andrew Zisserman,et al.  Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[4]  Maja Pantic,et al.  Deep complementary bottleneck features for visual speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Jürgen Schmidhuber,et al.  Lipreading with long short-term memory , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[7]  Satoshi Tamura,et al.  GIF-LR:GA-based informative feature for lipreading , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[8]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Satoshi Tamura,et al.  Audio-visual speech recognition using deep bottleneck features and high-performance lipreading , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[10]  Oscar N. Garcia,et al.  Rationale for Phoneme-Viseme Mapping and Feature Selection in Visual Speech Recognition , 1996 .

[11]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[12]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[13]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[14]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Josephine Sullivan,et al.  One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Matti Pietikäinen,et al.  OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[17]  Tetsuya Ogata,et al.  Lipreading using convolutional neural network , 2014, INTERSPEECH.

[18]  Matti Pietikäinen,et al.  A review of recent advances in visual speech decoding , 2014, Image Vis. Comput..

[19]  Matti Pietikäinen,et al.  A Compact Representation of Visual Speech Data Using Latent Variables , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[21]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[22]  Iasonas Kokkinos,et al.  Understanding Objects in Detail with Fine-Grained Attributes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[24]  Ming Liu,et al.  AVICAR: audio-visual speech corpus in a car environment , 2004, INTERSPEECH.

[25]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[26]  P.C. Woodland,et al.  The 1994 HTK large vocabulary speech recognition system , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[27]  Wilmot Li,et al.  Content-based tools for editing audio stories , 2013, UIST.

[28]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Matti Pietikäinen,et al.  Concatenated Frame Image Based CNN for Visual Speech Recognition , 2016, ACCV Workshops.

[30]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[31]  Petros Maragos,et al.  Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Mark Liberman,et al.  Speaker identification on the SCOTUS corpus , 2008 .

[33]  Patrick Lucey,et al.  Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments , 2004 .

[34]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[36]  Hermann Ney,et al.  Deep Learning of Mouth Shapes for Sign Language , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[37]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[39]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[40]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[41]  Andrew Zisserman,et al.  Learning sign language by watching TV (using weakly aligned subtitles) , 2009, CVPR.

[42]  Rainer Lienhart,et al.  Reliable Transition Detection in Videos: A Survey and Practitioner's Guide , 2001, Int. J. Image Graph..

[43]  Shuicheng Yan,et al.  Classification and Feature Extraction by Simplexization , 2008, IEEE Transactions on Information Forensics and Security.

[44]  Themos Stafylakis,et al.  Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.