Lipper: Synthesizing Thy Speech using Multi-View Lipreading

Lipreading has a lot of potential applications such as in the domain of surveillance and video conferencing. Despite this, most of the work in building lipreading systems has been limited to classifying silent videos into classes representing text phrases. However, there are multiple problems associated with making lipreading a text-based classification task like its dependence on a particular language and vocabulary mapping. Thus, in this paper we propose a multi-view lipreading to audio system, namely Lipper, which models it as a regression task. The model takes silent videos as input and produces speech as the output. With multi-view silent videos, we observe an improvement over single-view speech reconstruction results. We show this by presenting an exhaustive set of experiments for speaker-dependent, out-of-vocabulary and speaker-independent settings. Further, we compare the delay values of Lipper with other speechreading systems in order to show the real-time nature of audio produced. We also perform a user study for the audios produced in order to understand the level of comprehensibility of audios produced using Lipper.

[1]  Methods for objective and subjective assessment of quality Perceptual evaluation of speech quality ( PESQ ) : An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2002 .

[2]  Gerasimos Potamianos,et al.  Lipreading Using Profile Versus Frontal Views , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[3]  Dr. med. Rajiv Shah,et al.  Multimodal Analysis of User-Generated Multimedia Content , 2017, Socio-Affective Computing.

[4]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[5]  Richard Bowden,et al.  Learning Sequential Patterns for Lipreading , 2011, BMVC.

[6]  Sridha Sridharan,et al.  Continuous pose-invariant lipreading , 2008, INTERSPEECH.

[7]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[8]  Mostafa Mehdipour-Ghazi,et al.  Visual Speech Recognition Using PCA Networks and LSTMs in a Tandem GMM-HMM System , 2016, ACCV Workshops.

[9]  Walid Mahdi,et al.  Unified System for Visual Speech Recognition and Speaker Identification , 2015, ACIVS.

[10]  Joon Son Chung,et al.  Lip Reading in Profile , 2017, BMVC.

[11]  Hongbin Zha,et al.  Unsupervised Random Forest Manifold Alignment for Lipreading , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  Kee-Eung Kim,et al.  Multi-view Automatic Lip-Reading Using Neural Network , 2016, ACCV Workshops.

[13]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[14]  Matti Pietikäinen,et al.  OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[15]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Tsuhan Chen,et al.  Profile View Lip Reading , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[17]  Matti Pietikäinen,et al.  Towards a practical lipreading system , 2011, CVPR 2011.

[18]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[19]  Rohit Jain,et al.  MyLipper: A Personalized System for Speech Reconstruction using Multi-view Visual Feeds , 2018, 2018 IEEE International Symposium on Multimedia (ISM).

[20]  Walid Mahdi,et al.  A New Visual Speech Recognition Approach for RGB-D Cameras , 2014, ICIAR.

[21]  Shin'ichi Satoh,et al.  Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed , 2018, ACM Multimedia.

[22]  Ben P. Milner,et al.  Reconstructing intelligible audio speech from visual speech features , 2015, INTERSPEECH.

[23]  Matti Pietikäinen,et al.  Concatenated Frame Image Based CNN for Visual Speech Recognition , 2016, ACCV Workshops.

[24]  Shmuel Peleg,et al.  Improved Speech Reconstruction from Silent Video , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[25]  G. Fant Acoustic theory of speech production : with calculations based on X-ray studies of Russian articulations , 1961 .

[26]  Barry-John Theobald,et al.  View Independent Computer Lip-Reading , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[27]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[28]  F. Itakura Line spectrum representation of linear predictor coefficients of speech signals , 1975 .

[29]  Matti Pietikäinen,et al.  A review of recent advances in visual speech decoding , 2014, Image Vis. Comput..

[30]  Maja Pantic,et al.  End-to-End Multi-View Lipreading , 2017, BMVC.

[31]  Jean-Philippe Thiran,et al.  Multipose audio-visual speech recognition , 2011, 2011 19th European Signal Processing Conference.

[32]  Shmuel Peleg,et al.  Vid2speech: Speech reconstruction from silent video , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).