Towards a practical lipreading system

A practical lipreading system can be considered either as subject dependent (SD) or subject-independent (SI). An SD system is user-specific, i.e., customized for some particular user while an SI system has to cope with a large number of users. These two types of systems pose variant challenges and have to be treated differently. In this paper, we propose a simple deterministic model to tackle the problem. The model first seeks a low-dimensional manifold where visual features extracted from the frames of a video can be projected onto a continuous deterministic curve embedded in a path graph. Moreover, it can map arbitrary points on the curve back into the image space, making it suitable for temporal interpolation. Based on the model, we develop two separate strategies for SD and SI lipreading. The former is turned into a simple curve-matching problem while for the latter, we propose a video-normalization scheme to improve the system developed by Zhao et al. We evaluated our system on the OuluVS database and achieved recognition rates more than 20% higher than the ones reported by Zhao et al. in both SD and SI testing scenarios.

[1]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[2]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[3]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[4]  U. Feige,et al.  Spectral Graph Theory , 2015 .

[5]  Shuicheng Yan,et al.  Neighborhood preserving embedding , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[6]  Stephen Lin,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Yun Fu,et al.  Lipreading by Locality Discriminant Graph , 2007, 2007 IEEE International Conference on Image Processing.

[8]  Jenq-Neng Hwang,et al.  Lipreading from color video , 1997, IEEE Trans. Image Process..

[9]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[10]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Jean-Claude Junqua,et al.  A robust algorithm for word boundary detection in the presence of noise , 1994, IEEE Trans. Speech Audio Process..

[12]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[13]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[14]  Conrad Sanderson,et al.  The VidTIMIT Database , 2002 .

[15]  Ming Liu,et al.  AVICAR: audio-visual speech corpus in a car environment , 2004, INTERSPEECH.

[16]  James R. Glass,et al.  A segment-based audio-visual speech recognizer: data collection, development, and initial experiments , 2004, ICMI '04.

[17]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[18]  Trevor Darrell,et al.  Visual speech recognition with loosely synchronized feature streams , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[19]  Matti Pietikäinen,et al.  Lipreading: A Graph Embedding Approach , 2010, 2010 20th International Conference on Pattern Recognition.