Concatenated Frame Image Based CNN for Visual Speech Recognition

This paper proposed a novel sequence image representation method called concatenated frame image (CFI), two types of data augmentation methods for CFI, and a framework of CFI-based convolutional neural network (CNN) for visual speech recognition (VSR) task. CFI is a simple, however, it contains spatial-temporal information of a whole image sequence. The proposed method was evaluated with a public database OuluVS2. This is a multi-view audio-visual dataset recorded from 52 subjects. The speaker independent recognition tasks were carried out with various experimental conditions. As the result, the proposed method obtained high recognition accuracy.

[1]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[3]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Sridha Sridharan,et al.  Patch-based analysis of visual speech from multiple views , 2008, AVSP.

[5]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[6]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[7]  Daijin Kim,et al.  Real-time lip reading system for isolated Korean word recognition , 2011, Pattern Recognit..

[8]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[9]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[10]  Takeshi Saitoh Efficient face model for lip reading , 2013, AVSP.

[11]  Tetsuya Ogata,et al.  Lipreading using convolutional neural network , 2014, INTERSPEECH.

[12]  Matti Pietikäinen,et al.  A review of recent advances in visual speech decoding , 2014, Image Vis. Comput..

[13]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[14]  Mohamed R. Amer,et al.  Multimodal fusion using dynamic hybrid models , 2014, IEEE Winter Conference on Applications of Computer Vision.

[15]  Matti Pietikäinen,et al.  OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[16]  Takeshi Saitoh,et al.  Optical flow based lip reading using non rectangular ROI and head motion reduction , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[17]  Etsuya,et al.  Audio-Visual Speech Recognition Using Convolutive Bottleneck Networks for a Person with Severe Hearing Loss , 2015 .

[18]  Xuelong Li,et al.  Temporal Multimodal Learning in Audiovisual Speech Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).