Three-Dimensional Joint Geometric-Physiologic Feature for Lip-Reading

Lip-reading has been successfully demonstrated that it can improve the performance of automatic speech recognition system especially in the presence of acoustic noise. However, the information about lip movement is still insufficient as the lip features are obtained from discrete three-dimensional points and planar images. The internal mechanisms of lip movement are not described and reflected. In this paper, we employed a novel deepening technique, namely densely connected convolutional networks (DenseNets), to obtain visual representation from color images. In addition, a new 3D lip physiologic feature based on the position and structure of facial muscles was extracted to represent the similarity of the way people speak. The color image feature and 3D lip geometric-physiologic feature were coupled together in the last fully-connected layer of DenseNets. The experimental results show that DenseNets can handle spatial-temporal information of a whole image sequence and the lip feature integrating our proposed 3D geometric-physiological feature is sufficient to improve the recognition rate by as much as 3.91% (from 94.84%, with the color images only, to 98.75%).

[1]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[2]  A. Macleod,et al.  A procedure for measuring auditory and audio-visual speech-reception thresholds for sentences in noise: rationale, evaluation, and recommendations for use. , 1990, British journal of audiology.

[3]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[4]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Takeshi Saitoh,et al.  Analysis of efficient lip reading method for various languages , 2008, 2008 19th International Conference on Pattern Recognition.

[6]  Fillia Makedon,et al.  Bilingual corpus for AVASR using multiple sensors and depth information , 2011, AVSP.

[7]  Fillia Makedon,et al.  Audio-visual speech recognition incorporating facial depth information captured by the Kinect , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[8]  Fillia Makedon,et al.  Audio-visual speech recognition using depth information from the Kinect in noisy video conditions , 2012, PETRA '12.

[9]  Muzaffer Dogan,et al.  A lip reading application on MS Kinect camera , 2013, 2013 IEEE INISTA.

[10]  Karel Palecek,et al.  Extraction of Features for Lip-reading Using Autoencoders , 2014, SPECOM.

[11]  Hiroshi Seki,et al.  Comparison of syllable-based and phoneme-based DNN-HMM in Japanese speech recognition , 2014, 2014 International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA).

[12]  Walid Mahdi,et al.  A New Visual Speech Recognition Approach for RGB-D Cameras , 2014, ICIAR.

[13]  Josephine Sullivan,et al.  One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Jianwu Dang,et al.  Audio-visual speech recognition integrating 3D lip information obtained from the Kinect , 2016, Multimedia Systems.

[15]  Walid Mahdi,et al.  An adaptive approach for lip-reading using image and depth data , 2015, Multimedia Tools and Applications.

[16]  Kaushik Roy,et al.  Data augmentation in CNN-based periocular authentication , 2016, 2016 6th International Conference on Information Communication and Management (ICICM).

[17]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Koichi Shinoda,et al.  Multimodal speech recognition using mouth images from depth camera , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).