Lip reading with Hahn Convolutional Neural Networks

Abstract Lipreading or Visual speech recognition is the process of decoding speech from speaker's mouth movements. It is used for people with hearing impairment, to understand patients attained with laryngeal cancer, people with vocal cord paralysis and in noisy environment. In this paper we aim to develop a visual-only speech recognition system based only on video. Our main targeted application is in the medical field for the assistance to laryngectomized persons. To that end, we propose Hahn Convolutional Neural Network (HCNN), a novel architecture based on Hahn moments as first layer in the Convolutional Neural Network (CNN) architecture. We show that HCNN helps in reducing the dimensionality of video images, in gaining training time. HCNN model is trained to classify letters, digits or words given as video images. We evaluated the proposed method on three datasets, AVLetters, OuluVS2 and BBC LRW, and we show that it achieves significant results in comparison with other works in the literature.

[1]  Ahmed M. Elgammal,et al.  Manifold-Kernels Comparison in MKPLS for Visual Speech Recognition , 2016, ArXiv.

[2]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[3]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[4]  Huazhong Shu,et al.  Image Analysis by Discrete Orthogonal Hahn Moments , 2005, ICIAR.

[5]  Timothy F. Cootes,et al.  Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[6]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[8]  Matti Pietikäinen,et al.  Concatenated Frame Image Based CNN for Visual Speech Recognition , 2016, ACCV Workshops.

[9]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[10]  Themos Stafylakis,et al.  Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[11]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[12]  Barry-John Theobald,et al.  Comparison of human and machine-based lip-reading , 2009, AVSP.

[13]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Hassan Qjidaa,et al.  Fast and efficient computation of three-dimensional Hahn moments , 2016, J. Electronic Imaging.