Visual Speech Recognition with Lightweight Psychologically Motivated Gabor Features

Extraction of relevant lip features is of continuing interest in the visual speech domain. Using end-to-end feature extraction can produce good results, but at the cost of the results being difficult for humans to comprehend and relate to. We present a new, lightweight feature extraction approach, motivated by human-centric glimpse-based psychological research into facial barcodes, and demonstrate that these simple, easy to extract 3D geometric features (produced using Gabor-based image patches), can successfully be used for speech recognition with LSTM-based machine learning. This approach can successfully extract low dimensionality lip parameters with a minimum of processing. One key difference between using these Gabor-based features and using other features such as traditional DCT, or the current fashion for CNN features is that these are human-centric features that can be visualised and analysed by humans. This means that it is easier to explain and visualise the results. They can also be used for reliable speech recognition, as demonstrated using the Grid corpus. Results for overlapping speakers using our lightweight system gave a recognition rate of over 82%, which compares well to less explainable features in the literature.

[1]  J. Myerson,et al.  Lipreading in school-age children: the roles of age, hearing status, and cognitive ability. , 2014, Journal of speech, language, and hearing research : JSLHR.

[2]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[3]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[4]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[5]  Darryl Stewart,et al.  Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos , 2008, EURASIP J. Image Video Process..

[6]  Amir Hussain,et al.  Novel Two-Stage Audiovisual Speech Filtering in Noisy Environments , 2013, Cognitive Computation.

[7]  Demetri Terzopoulos,et al.  Snakes: Active contour models , 2004, International Journal of Computer Vision.

[8]  Ke Gu,et al.  Review on Automatic Lip Reading Techniques , 2017, Int. J. Pattern Recognit. Artif. Intell..

[9]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[10]  Federico Sukno,et al.  Survey on automatic lip-reading in the era of deep learning , 2018, Image Vis. Comput..

[11]  D. Hubel,et al.  Receptive fields and functional architecture of monkey striate cortex , 1968, The Journal of physiology.

[12]  A. O'Toole,et al.  Prototype-referenced shape encoding revealed by high-level aftereffects , 2001, Nature Neuroscience.

[13]  Henning Puder,et al.  A Noise Reduction Postfilter for Binaurally Linked Single-Microphone Hearing Aids Utilizing a Nearby External Microphone , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  T. Santhanam,et al.  A Novel Approach Integrating Geometric and Gabor Wavelet Approaches to Improvise Visual Lipreading , 2010 .

[15]  Amir Hussain,et al.  Cognitively Inspired Audiovisual Speech Filtering , 2015, SpringerBriefs in Cognitive Computation.

[16]  R. Watt,et al.  Biological "bar codes" in human faces. , 2009, Journal of vision.

[17]  Maja Pantic,et al.  End-to-End Audiovisual Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Timothy Israel Santos,et al.  Using Feature Visualisation for Explaining Deep Learning Models in Visual Speech , 2019, 2019 IEEE 4th International Conference on Big Data Analytics (ICBDA).

[19]  Ming-Hsuan Yang,et al.  Incremental Learning for Robust Visual Tracking , 2008, International Journal of Computer Vision.

[20]  SchmidhuberJürgen,et al.  2005 Special Issue , 2005 .

[21]  Jon Barker,et al.  Stream weight estimation for multistream audio-visual speech recognition in a multispeaker environment , 2008, Speech Commun..