Visual recognition of human communication

The objective of this work is visual recognition of speech and gestures. Solving this problem opens up a host of applications, such as transcribing archival silent films, or resolving multi- talker simultaneous speech, but most importantly it helps to advance the state of the art in speech recognition by enabling machines to take advantage of the multi-modal nature of human communications. However, visual recognition of speech and gestures is a challenging problem, in part due to the lack of annotations and datasets, but also due to the inter- and intra-personal variations, and in the case of visual speech, ambiguities arising from homophones. Training a deep learning algorithm requires a lot of training data. We propose a method to automatically collect, process and generate a large-scale audio-visual corpus from television videos temporally aligned with the transcript. To build such dataset, it is essential to know 'who' is speaking 'when'. We develop a ConvNet model that learns joint embedding of the sound and the mouth images from unlabelled data, and apply this network to the tasks of audio-to-video synchronisation and active speaker detection. Not only does this play a crucial role in building the dataset that forms the basis of much of the research done in this thesis, the method learns powerful representations of the visual and auditory inputs which can be used for related tasks such as lip reading. We also show that the methods developed here can be extended to the problem of generating talking faces from audio and still images. We then propose a number of deep learning models that are able to recognise visual speech at word and sentence level. In both scenarios, we also demonstrate recognition performance that exceeds the state of the art on public datasets; and in the case of the latter, the lip reading performance beats a professional lip reader on videos from BBC television. We also demonstrate that if audio is available, then visual information helps to improve speech recognition performance. Next, we present a method to recognise and localise short temporal signals in image time series, where strong supervision is not available for training. We propose image encodings and ConvNet-based architectures to first recognise the signal, and then to localise the signal using back-propagation. The method is demonstrated for localising spoken words in audio, and for localising signed gestures in British Sign Language (BSL) videos. Finally, we explore the problem of speaker recognition. Whereas previous works for speaker identification have been limited to constrained conditions, here we build a new large-scale speaker recognition dataset collected from 'in the wild' videos using an automated pipeline. We propose a number of ConvNet architectures that outperforms traditional baselines on this dataset.

[1]  Satoshi Tamura,et al.  Audio-visual speech recognition using deep bottleneck features and high-performance lipreading , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[2]  John H. L. Hansen,et al.  High performance digit recognition in real car environments , 2002, INTERSPEECH.

[3]  Andrew Zisserman,et al.  Domain-Adaptive Discriminative One-Shot Learning of Gestures , 2014, ECCV.

[4]  Petros Maragos,et al.  Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Thad Starner,et al.  American sign language recognition with the kinect , 2011, ICMI '11.

[6]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[7]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Oliver Durr,et al.  Speaker identification and clustering using convolutional neural networks , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[10]  Wei Li,et al.  3D SMoSIFT: three-dimensional sparse motion scale invariant feature transform for activity recognition from RGB-D videos , 2014, J. Electronic Imaging.

[11]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[12]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[13]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[14]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[15]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[16]  Mark Liberman,et al.  Speaker identification on the SCOTUS corpus , 2008 .

[17]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[18]  Andrew Zisserman,et al.  Large-scale Learning of Sign Language by Watching TV (Using Co-occurrences) , 2013, BMVC.

[19]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[20]  Rainer Lienhart,et al.  Reliable Transition Detection in Videos: A Survey and Practitioner's Guide , 2001, Int. J. Image Graph..

[21]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[22]  Vaibhava Goel,et al.  Detecting audio-visual synchrony using deep neural networks , 2015, INTERSPEECH.

[23]  Themos Stafylakis,et al.  Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[24]  Matti Pietikäinen,et al.  A review of recent advances in visual speech decoding , 2014, Image Vis. Comput..

[25]  Alex Park,et al.  The MIT Mobile Device Speaker Verification Corpus: Data Collection and Preliminary Experiments , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[26]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Hermann Ney,et al.  Deep Sign: Hybrid CNN-HMM for Continuous Sign Language Recognition , 2016, BMVC.

[28]  Abhinav Gupta,et al.  Unsupervised Learning of Visual Representations Using Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[31]  Themos Stafylakis,et al.  Deep Word Embeddings for Visual Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  John Lewis Automated lip-sync: Background and techniques , 1991, Comput. Animat. Virtual Worlds.

[34]  Wilmot Li,et al.  Content-based tools for editing audio stories , 2013, UIST.

[35]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[36]  Andrew Zisserman,et al.  Deep Convolutional Neural Networks for Efficient Pose Estimation in Gesture Videos , 2014, ACCV.

[37]  Geoffrey E. Hinton,et al.  Training Recurrent Neural Networks , 2013 .

[38]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Sébastien Marcel,et al.  MOBIO Database for the ICPR 2010 Face and Speech Competition , 2009 .

[40]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[41]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[42]  Barry-John Theobald,et al.  Comparing visual features for lipreading , 2009, AVSP.

[43]  Kee-Eung Kim,et al.  Multi-view Automatic Lip-Reading Using Neural Network , 2016, ACCV Workshops.

[44]  Wei Li,et al.  One-shot learning gesture recognition from RGB-D data using bag of features , 2013, J. Mach. Learn. Res..

[45]  Sudeep Sarkar,et al.  Similarity Measure between Two Gestures Using Triplets , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[46]  Tetsuya Ogata,et al.  Lipreading using convolutional neural network , 2014, INTERSPEECH.

[47]  Aaron Lawson,et al.  The Speakers in the Wild (SITW) Speaker Recognition Database , 2016, INTERSPEECH.

[48]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[49]  Josephine Sullivan,et al.  One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Marwan Mattar,et al.  Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[51]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[52]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[53]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[54]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[55]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[56]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[58]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[59]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[60]  Hermann Ney,et al.  Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data is Continuous and Weakly Labelled , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[63]  T. K. Vintsyuk Speech discrimination by dynamic programming , 1968 .

[64]  M. Marschark,et al.  The Oxford Handbook of Deaf Studies, Language, and Education, Volume 2. , 2010 .

[65]  Andrew Zisserman,et al.  Flowing ConvNets for Human Pose Estimation in Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[66]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[67]  Jürgen Schmidhuber,et al.  Lipreading with long short-term memory , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[68]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[69]  Iasonas Kokkinos,et al.  Understanding Objects in Detail with Fine-Grained Attributes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[70]  Hermann Ney,et al.  Deep Learning of Mouth Shapes for Sign Language , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[71]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[72]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[73]  Ming Liu,et al.  AVICAR: audio-visual speech corpus in a car environment , 2004, INTERSPEECH.

[74]  Andreas Stolcke,et al.  Artificial neural network features for speaker diarization , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[75]  Satoshi Tamura,et al.  GIF-LR:GA-based informative feature for lipreading , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[76]  Matti Pietikäinen,et al.  A Compact Representation of Visual Speech Data Using Latent Variables , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[77]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[78]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[79]  Hongbin Zha,et al.  Unsupervised Random Forest Manifold Alignment for Lipreading , 2013, 2013 IEEE International Conference on Computer Vision.

[80]  A. Murat Tekalp,et al.  Audiovisual Synchronization and Fusion Using Canonical Correlation Analysis , 2007, IEEE Transactions on Multimedia.

[81]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[82]  Maja Pantic,et al.  Deep complementary bottleneck features for visual speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[83]  Edward H. Adelson,et al.  Learning visual groups from co-occurrences in space and time , 2015, ArXiv.

[84]  David F. McAllister,et al.  Lip synchronization of speech , 1997, AVSP.

[85]  D. Bitzer,et al.  Automated lip-sync: direct translation of speech-sound to mouth-shape , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[86]  Ivan Laptev,et al.  Is object localization for free? - Weakly-supervised learning with convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[87]  Tetsuya Ogata,et al.  Audio-visual speech recognition using deep learning , 2014, Applied Intelligence.

[88]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[89]  Matti Pietikäinen,et al.  Concatenated Frame Image Based CNN for Visual Speech Recognition , 2016, ACCV Workshops.

[90]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[91]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[92]  Laura Cristina Lanzarini,et al.  LSA64: An Argentinian Sign Language Dataset , 2023, ArXiv.

[93]  Lara Lynn Stoll,et al.  Finding Difficult Speakers in Automatic Speaker Recognition , 2011 .

[94]  David A. van Leeuwen,et al.  NFI-FRITS: A forensic speaker recognition database and some first experiments , 2014, Odyssey.

[95]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[96]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[97]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[98]  Vaibhava Goel,et al.  Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[99]  Desmond Morris,et al.  The naked ape : a zoologist's study of the human animal , 1968 .

[100]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[101]  Enrique Argones-Rúa,et al.  Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models , 2009, Pattern Analysis and Applications.

[102]  Stan Sclaroff,et al.  Exploiting phonological constraints for handshape inference in ASL video , 2011, CVPR 2011.

[103]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[104]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[105]  Shengcai Liao,et al.  Learning Face Representation from Scratch , 2014, ArXiv.

[106]  Igor S. Pandzic,et al.  A Real-Time Lip SYNC System Using a Genetic Algorithm for Automatic Neural Network Configuration , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[107]  Karl-Friedrich Kraiss,et al.  Recent developments in visual sign language recognition , 2008, Universal Access in the Information Society.

[108]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[109]  J.B. Millar,et al.  The Australian National Database of Spoken Language , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[110]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[111]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[112]  P.C. Woodland,et al.  The 1994 HTK large vocabulary speech recognition system , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[113]  Kyoung Mu Lee,et al.  Accurate Image Super-Resolution Using Very Deep Convolutional Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[114]  Satoshi Nakamura,et al.  Audio-visual speech translation with automatic lip syncqronization and face tracking based on 3-D head model , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[115]  Patrick Pérez,et al.  Poisson image editing , 2003, ACM Trans. Graph..

[116]  Amirsina Torfi,et al.  3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition , 2017, IEEE Access.

[117]  Douglas D. O'Shaughnessy,et al.  Invited paper: Automatic speech recognition: History, methods and challenges , 2008, Pattern Recognit..

[118]  Dominique Genoud,et al.  POLYCOST: A telephone-speech database for speaker recognition , 2000, Speech Commun..

[119]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[120]  A. Peregudov,et al.  Relative timing of sound and vision: evaluation and correction , 2005, Proceedings of the Ninth International Symposium on Consumer Electronics, 2005. (ISCE 2005)..

[121]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.