Manifold-Kernels Comparison in MKPLS for Visual Speech Recognition

Speech recognition is a challenging problem. Due to the acoustic limitations, using visual information is essential for improving the recognition accuracy in real-life unconstraint situations. One common approach is to model the visual recognition as nonlinear optimization problem. Measuring the distances between visual units is essential for solving this problem. Embedding the visual units on a manifold and using manifold kernels is one way to measure these distances. This work is intended to evaluate the performance of several manifold kernels for solving the problem of visual speech recognition. We show the theory behind each kernel. We apply manifold kernel partial least squares framework to OuluVs and AvLetters databases, and show empirical comparison between all kernels. This framework provides convenient way to explore different kernels.

[1]  Alan Edelman,et al.  The Geometry of Algorithms with Orthogonality Constraints , 1998, SIAM J. Matrix Anal. Appl..

[2]  Trevor Darrell,et al.  Visual speech recognition with loosely synchronized feature streams , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[3]  Louis H. Terry,et al.  Audio-Visual and Visual-Only Speech and Speaker Recognition: Issues about Theory, System Design, and Implementation , 2008 .

[4]  Matti Pietikäinen,et al.  Lipreading: A Graph Embedding Approach , 2010, 2010 20th International Conference on Pattern Recognition.

[5]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[6]  Ahmed M. Elgammal,et al.  MKPLS: Manifold Kernel Partial Least Squares for Lipreading and Speaker Identification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Kuldip K. Paliwal,et al.  Identity verification using speech and face information , 2004, Digit. Signal Process..

[8]  Matti Pietikäinen,et al.  Towards a practical lipreading system , 2011, CVPR 2011.

[9]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[10]  Alan Wee-Chung Liew,et al.  Visual Speech Recognition: Lip Segmentation and Mapping , 2008 .

[11]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[12]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[13]  Brian C. Lovell,et al.  Graph embedding discriminant analysis on Grassmannian manifolds for improved image set matching , 2011, CVPR 2011.

[14]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Juergen Luettin,et al.  Speaker identification by lipreading , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[17]  Jayavardhana Gubbi,et al.  Lip reading using optical flow and support vector machines , 2010, 2010 3rd International Congress on Image and Signal Processing.

[18]  Roman Rosipal,et al.  Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space , 2002, J. Mach. Learn. Res..