Lip Reading Using Convolutional Neural Networks with and without Pre-Trained Models

Lip reading has become a popular topic recently. There is a widespread literature studies on lip reading in human action recognition. Deep learning methods are frequently used in this area. In this paper, lip reading from video data is performed using self designed convolutional neural networks (CNNs). For this purpose, standard and also augmented AvLetters dataset is used train and test stages. To optimize network performance, minibatchsize parameter is also tuned and its effect is investigated. Additionally, experimental studies are performed using AlexNet and GoogleNet pre-trained CNNs. Detailed experimental results are presented.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Walid Mahdi,et al.  A New Visual Speech Recognition Approach for RGB-D Cameras , 2014, ICIAR.

[3]  Eric Cosatto,et al.  Classification of mitotic figures with convolutional neural networks and seeded blob features , 2013, Journal of pathology informatics.

[4]  H. Nogay,et al.  A Convolutional Neural Network Application for Predicting the Locating of Squamous Cell Carcinoma in the Lung , 2018, Balkan Journal of Electrical and Computer Engineering.

[5]  Ranvijay,et al.  Lip reading techniques: A survey , 2016, 2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT).

[6]  Hasan Badem,et al.  A Deep Neural Network Classifier for Decoding Human Brain Activity Based on Magnetoencephalography , 2017 .

[7]  Zhigang Luo,et al.  Audio visual speech recognition with multimodal recurrent neural networks , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[8]  Hasan Badem,et al.  Performance improvement of deep neural network classifiers by a simple training strategy , 2018, Eng. Appl. Artif. Intell..

[9]  Federico Sukno,et al.  Survey on automatic lip-reading in the era of deep learning , 2018, Image Vis. Comput..

[10]  Tetsuya Takiguchi,et al.  Lip reading using a dynamic feature of lip images and convolutional neural networks , 2016, 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS).

[11]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[12]  Walid Mahdi,et al.  Unified System for Visual Speech Recognition and Speaker Identification , 2015, ACIVS.

[13]  Muzaffer Dogan,et al.  A lip reading application on MS Kinect camera , 2013, 2013 IEEE INISTA.

[14]  H. Selcuk Nogay Classification of Different Cancer Types by Deep Convolutional Neural Networks , 2018 .

[15]  Ting Liu,et al.  Recent advances in convolutional neural networks , 2015, Pattern Recognit..

[16]  Hasan Badem,et al.  Classification of high resolution hyperspectral remote sensing data using deep neural networks , 2018, J. Intell. Fuzzy Syst..

[17]  Sabri Gurbuz,et al.  Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus , 2002, EURASIP J. Adv. Signal Process..

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Etsuya,et al.  Audio-Visual Speech Recognition Using Convolutive Bottleneck Networks for a Person with Severe Hearing Loss , 2015 .

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Amit Garg amit,et al.  Lip reading using CNN and LSTM , 2016 .

[22]  Walid Mahdi,et al.  Human Machine Interaction via Visual Speech Spotting , 2015, ACIVS.

[23]  Matti Pietikäinen,et al.  OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[24]  Hasan Badem,et al.  A new efficient training strategy for deep neural networks by hybridization of artificial bee colony and limited-memory BFGS optimization algorithms , 2017, Neurocomputing.

[25]  Maja Pantic,et al.  End-to-end visual speech recognition with LSTMS , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Ran He,et al.  Digital recognition from lip texture analysis , 2016, 2016 IEEE International Conference on Digital Signal Processing (DSP).

[27]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Tetsuya Takiguchi,et al.  Audio-Visual Speech Recognition Using Bimodal-Trained Bottleneck Features for a Person with Severe Hearing Loss , 2016, INTERSPEECH.

[30]  Themos Stafylakis,et al.  Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.