A Compact Representation of Visual Speech Data Using Latent Variables

The problem of visual speech recognition involves the decoding of the video dynamics of a talking mouth in a high-dimensional visual space. In this paper, we propose a generative latent variable model to provide a compact representation of visual speech data. The model uses latent variables to separately represent the inter-speaker variations of visual appearances and those caused by uttering, and incorporates the structural information of the observed visual data within an utterance through modelling the structure using a path graph and placing variables' priors along its embedded curve.

[1]  Kevin P. Murphy,et al.  Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[2]  Yong Liu,et al.  Latent Gaussian Mixture Regression for Human Pose Estimation , 2010, ACCV.

[3]  Matti Pietikäinen,et al.  Towards a practical lipreading system , 2011, CVPR 2011.

[4]  Jeff A. Bilmes,et al.  DBN based multi-stream models for audio-visual speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[6]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[7]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[8]  H. Damasio,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence: Special Issue on Perceptual Organization in Computer Vision , 1998 .

[9]  Cristian Sminchisescu,et al.  Spectral Latent Variable Models for Perceptual Inference , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[10]  Barry-John Theobald,et al.  Comparing visual features for lipreading , 2009, AVSP.

[11]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[12]  Giridharan Iyengar,et al.  A Cascade Visual Front End for Speaker Independent Automatic Speechreading , 2001, Int. J. Speech Technol..

[13]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[14]  Barry-John Theobald,et al.  Improving visual features for lip-reading , 2010, AVSP.

[15]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Trevor Darrell,et al.  Multistream Articulatory Feature-Based Models for Visual Speech Recognition , 2009, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Hyeonjoon Moon,et al.  The FERET evaluation methodology for face-recognition algorithms , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Matti Pietikäinen,et al.  Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[21]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Umar Mohammed,et al.  Probabilistic Models for Inference about Identity , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[25]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[26]  U. Feige,et al.  Spectral Graph Theory , 2015 .

[27]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .