Lip-based visual speech recognition system

This paper proposes a system to address the problem of visual speech recognition. The proposed system is based on visual lip movement recognition by applying video content analysis technique. Using spatiotemporal features descriptors, we extracted features from video containing visual lip information. A preprocessing step is employed by removing the noise and enhancing the contrast of images in every frames of video. Extracted feature are used to build a dictionary for kernel sparse representation classifier (K-SRC) in the classification step. We adopted non-negative matrix factorization (NMF) method to reduce the dimensionality of the extracted features. We evaluated the performance of our system using AVLetters and AVLetters2 dataset. To evaluate the performance of our system, we used the same configuration as another previous works. Using AVLetters dataset, the promising accuracies of 67.13%, 45.37%, and 63.12% can be achieved in semi speaker dependent, speaker independent, and speaker dependent, respectively. Using AVLetters2 dataset, our method can achieve accuracy rate of 89.02% for speaker dependent case and 25.9% for speaker independent. This result showed that our proposed method outperforms another methods using same configuration.

[1]  Stephen J. Cox,et al.  The challenge of multispeaker lip-reading , 2008, AVSP.

[2]  Thomas S. Huang,et al.  A fast two-dimensional median filtering algorithm , 1979 .

[3]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Hua Yu,et al.  A direct LDA algorithm for high-dimensional data - with application to face recognition , 2001, Pattern Recognit..

[5]  Matti Pietikäinen,et al.  Human Activity Recognition Using a Dynamic Texture Based Method , 2008, BMVC.

[6]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[7]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[8]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[9]  P. Paatero Least squares formulation of robust non-negative factor analysis , 1997 .

[10]  Kenji Kita,et al.  Dimensionality reduction using non-negative matrix factorization for information retrieval , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).

[11]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[12]  Ting Wang,et al.  Kernel Sparse Representation-Based Classifier , 2012, IEEE Transactions on Signal Processing.

[13]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[14]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[15]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Matti Pietikäinen,et al.  Local Binary Pattern Descriptors for Dynamic Texture Recognition , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[17]  Matti Pietikäinen,et al.  A review of recent advances in visual speech decoding , 2014, Image Vis. Comput..

[18]  Alioune Ngom,et al.  Sparse representation approaches for the classification of high-dimensional biological data , 2012, 2012 IEEE International Conference on Bioinformatics and Biomedicine.