论文信息 - Concatenated Frame Image Based CNN for Visual Speech Recognition

Concatenated Frame Image Based CNN for Visual Speech Recognition

This paper proposed a novel sequence image representation method called concatenated frame image (CFI), two types of data augmentation methods for CFI, and a framework of CFI-based convolutional neural network (CNN) for visual speech recognition (VSR) task. CFI is a simple, however, it contains spatial-temporal information of a whole image sequence. The proposed method was evaluated with a public database OuluVS2. This is a multi-view audio-visual dataset recorded from 52 subjects. The speaker independent recognition tasks were carried out with various experimental conditions. As the result, the proposed method obtained high recognition accuracy.

[1] Yochai Konig,et al. "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[2] Juergen Luettin,et al. Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[3] Timothy F. Cootes,et al. Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[4] Sridha Sridharan,et al. Patch-based analysis of visual speech from multiple views , 2008, AVSP.

[5] Matti Pietikäinen,et al. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[6] Christian Wolf,et al. Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[7] Daijin Kim,et al. Real-time lip reading system for isolated Korean word recognition , 2011, Pattern Recognit..

[8] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[9] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[10] Takeshi Saitoh. Efficient face model for lip reading , 2013, AVSP.

[11] Tetsuya Ogata,et al. Lipreading using convolutional neural network , 2014, INTERSPEECH.

[12] Matti Pietikäinen,et al. A review of recent advances in visual speech decoding , 2014, Image Vis. Comput..

[13] Qiang Chen,et al. Network In Network , 2013, ICLR.

[14] Mohamed R. Amer,et al. Multimodal fusion using dynamic hybrid models , 2014, IEEE Winter Conference on Applications of Computer Vision.

[15] Matti Pietikäinen,et al. OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[16] Takeshi Saitoh,et al. Optical flow based lip reading using non rectangular ROI and head motion reduction , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[17] Etsuya,et al. Audio-Visual Speech Recognition Using Convolutive Bottleneck Networks for a Person with Severe Hearing Loss , 2015 .

[18] Xuelong Li,et al. Temporal Multimodal Learning in Audiovisual Speech Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).