MKPLS: Manifold Kernel Partial Least Squares for Lipreading and Speaker Identification

Visual speech recognition is a challenging problem, due to confusion between visual speech features. The speaker identification problem is usually coupled with speech recognition. Moreover, speaker identification is important to several applications, such as automatic access control, biometrics, authentication, and personal privacy issues. In this paper, we propose a novel approach for lip reading and speaker identification. We propose a new approach for manifold parameterization in a low-dimensional latent space, where each manifold is represented as a point in that space. We initially parameterize each instance manifold using a nonlinear mapping from a unified manifold representation. We then factorize the parameter space using Kernel Partial Least Squares (KPLS) to achieve a low-dimension manifold latent space. We use two-way projections to achieve two manifold latent spaces, one for the speech content and one for the speaker. We apply our approach on two public databases: AVLetters and OuluVS. We show the results for three different settings of lip reading: speaker independent, speaker dependent, and speaker semi-dependent. Our approach outperforms for the speaker semi-dependent setting by at least 15% of the baseline, and competes in the other two settings.

[1]  Yun Fu,et al.  Lipreading by Locality Discriminant Graph , 2007, 2007 IEEE International Conference on Image Processing.

[2]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[3]  James R. Glass,et al.  A segment-based audio-visual speech recognizer: data collection, development, and initial experiments , 2004, ICMI '04.

[4]  Ming Liu,et al.  AVICAR: audio-visual speech corpus in a car environment , 2004, INTERSPEECH.

[5]  J. Gani,et al.  Perspectives in Probability and Statistics. , 1980 .

[6]  Jayavardhana Gubbi,et al.  Lip reading using optical flow and support vector machines , 2010, 2010 3rd International Congress on Image and Signal Processing.

[7]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[8]  Kristin P. Bennett,et al.  An Optimization Perspective on Kernel Partial Least Squares Regression , 2003 .

[9]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Alan Wee-Chung Liew,et al.  Visual Speech Recognition: Lip Segmentation and Mapping , 2008 .

[11]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[12]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[13]  David G. Stork,et al.  Pattern Classification , 1973 .

[14]  Sabri Gurbuz,et al.  Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus , 2002, EURASIP J. Adv. Signal Process..

[15]  Kuldip K. Paliwal,et al.  Identity verification using speech and face information , 2004, Digit. Signal Process..

[16]  Ahmed M. Elgammal,et al.  Homeomorphic Manifold Analysis: Learning Decomposable Generative Models for Human Motion Analysis , 2006, WDV.

[17]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[18]  Paul J. Lewi,et al.  Pattern recognition, reflections from a chemometric point of view , 1995 .

[19]  Trevor Darrell,et al.  Visual speech recognition with loosely synchronized feature streams , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[20]  Louis H. Terry,et al.  Audio-Visual and Visual-Only Speech and Speaker Recognition: Issues about Theory, System Design, and Implementation , 2008 .

[21]  Stephen J. Cox,et al.  The challenge of multispeaker lip-reading , 2008, AVSP.

[22]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Juergen Luettin,et al.  Speaker identification by lipreading , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[24]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[25]  Roman Rosipal,et al.  Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space , 2002, J. Mach. Learn. Res..

[26]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[27]  Daniel D. Lee,et al.  Semisupervised alignment of manifolds , 2005, AISTATS.

[28]  G. Wahba,et al.  A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines , 1970 .

[29]  Matti Pietikäinen,et al.  Towards a practical lipreading system , 2011, CVPR 2011.

[30]  A. Elgammal,et al.  Separating style and content on a nonlinear manifold , 2004, CVPR 2004.