论文信息 - Learning to lip read words by watching videos

Learning to lip read words by watching videos

Abstract Our aim is to recognise the words being spoken by a talking face, given only the video but not the audio. Existing works in this area have focussed on trying to recognise a small number of utterances in controlled environments (e.g. digits and alphabets), partially due to the shortage of suitable datasets. We make three novel contributions: first, we develop a pipeline for fully automated data collection from TV broadcasts. With this we have generated a dataset with over a million word instances, spoken by over a thousand different people; second, we develop a two-stream convolutional neural network that learns a joint embedding between the sound and the mouth motions from unlabelled data. We apply this network to the tasks of audio-to-video synchronisation and active speaker detection; third, we train convolutional and recurrent networks that are able to effectively learn and recognize hundreds of words from this large-scale dataset. In lip reading and in speaker detection, we demonstrate results that exceed the current state-of-the-art on public benchmark datasets.

Joon Son Chung | Andrew Zisserman | Andrew Zisserman

[1] J.N. Gowdy,et al. CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2] Hongbin Zha,et al. Unsupervised Random Forest Manifold Alignment for Lipreading , 2013, 2013 IEEE International Conference on Computer Vision.

[3] Andrew Zisserman,et al. Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[4] Maja Pantic,et al. Deep complementary bottleneck features for visual speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Jürgen Schmidhuber,et al. Lipreading with long short-term memory , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Jon Barker,et al. An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[7] Satoshi Tamura,et al. GIF-LR:GA-based informative feature for lipreading , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[8] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9] Satoshi Tamura,et al. Audio-visual speech recognition using deep bottleneck features and high-performance lipreading , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[10] Oscar N. Garcia,et al. Rationale for Phoneme-Viseme Mapping and Feature Selection in Visual Speech Recognition , 1996 .

[11] Andrew Zisserman,et al. Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[12] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[13] Andrew Zisserman,et al. Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[14] Joon Son Chung,et al. Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Josephine Sullivan,et al. One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16] Matti Pietikäinen,et al. OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[17] Tetsuya Ogata,et al. Lipreading using convolutional neural network , 2014, INTERSPEECH.

[18] Matti Pietikäinen,et al. A review of recent advances in visual speech decoding , 2014, Image Vis. Comput..

[19] Matti Pietikäinen,et al. A Compact Representation of Visual Speech Data Using Latent Variables , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[21] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[22] Iasonas Kokkinos,et al. Understanding Objects in Detail with Fine-Grained Attributes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23] Matti Pietikäinen,et al. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[24] Ming Liu,et al. AVICAR: audio-visual speech corpus in a car environment , 2004, INTERSPEECH.

[25] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[26] P.C. Woodland,et al. The 1994 HTK large vocabulary speech recognition system , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[27] Wilmot Li,et al. Content-based tools for editing audio stories , 2013, UIST.

[28] Timothy F. Cootes,et al. Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[29] Matti Pietikäinen,et al. Concatenated Frame Image Based CNN for Visual Speech Recognition , 2016, ACCV Workshops.

[30] Davis E. King,et al. Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[31] Petros Maragos,et al. Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[32] Mark Liberman,et al. Speaker identification on the SCOTUS corpus , 2008 .

[33] Patrick Lucey,et al. Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments , 2004 .

[34] Matthew J. Hausknecht,et al. Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Yann LeCun,et al. Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[36] Hermann Ney,et al. Deep Learning of Mouth Shapes for Sign Language , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[37] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[39] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.

[40] H Hermansky,et al. Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[41] Andrew Zisserman,et al. Learning sign language by watching TV (using weakly aligned subtitles) , 2009, CVPR.

[42] Rainer Lienhart,et al. Reliable Transition Detection in Videos: A Survey and Practitioner's Guide , 2001, Int. J. Image Graph..

[43] Shuicheng Yan,et al. Classification and Feature Extraction by Simplexization , 2008, IEEE Transactions on Information Forensics and Security.

[44] Themos Stafylakis,et al. Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.