Visual Recognition of Continuous Cued Speech Using a Tandem CNN-HMM Approach

This study addresses the problem of automatic recognition of Cued Speech (CS), a visual mode of communication for hearing impaired people in which a complete phonetic repertoire is obtained by combining lip movements with hand cues. In the proposed system, the dynamic of visual features extracted from lip and hand images using convolutional neural networks (CNN) are modeled by a set of hidden Markov models (HMM), for each phonetic context (tandem architecture). CNN-based feature extraction is compared to an unsupervised approach based on the principal component analysis. A novel temporal segmentation of hand streams is used to train CNNs efficiently. Different strategies for combining the extracted visual features within the HMM decoder are investigated. Experimental evaluation is carried on an audiovisual dataset (containing only continuous French sentences) recorded specifically for this study. In its best configuration, and without exploiting any dictionary or language model, the proposed tandem CNN-HMM architecture is able to identify correctly more than 73% of the phoneme (62% when considering insertion errors).

[1]  Benjamin Schrauwen,et al.  Sign Language Recognition Using Convolutional Neural Networks , 2014, ECCV Workshops.

[2]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[3]  Denis Beautemps,et al.  Cued Speech automatic recognition in normal-hearing and deaf subjects , 2010, Speech Commun..

[4]  Shmuel Peleg,et al.  Vid2speech: Speech reconstruction from silent video , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Thomas Burger,et al.  Cued speech hand gestures recognition tool , 2005, 2005 13th European Signal Processing Conference.

[6]  Tetsuya Ogata,et al.  Lipreading using convolutional neural network , 2014, INTERSPEECH.

[7]  Noureddine Aboutabit Reconnaissance de la Langue Française Parlée Complété (LPC) : décodage phonétique des gestes main-lèvres. , 2007 .

[8]  Carlo Tomasi,et al.  Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Jürgen Schmidhuber,et al.  Lipreading with long short-term memory , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Norihiro Hagita,et al.  Continuous phoneme recognition in Cued Speech for French , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[11]  Li Liu,et al.  Automatic Temporal Segmentation of Hand Movements for Hand Positions Recognition in French Cued Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Thomas Hueber,et al.  Statistical conversion of silent articulation into audible speech using full-covariance HMM , 2016, Comput. Speech Lang..

[13]  Naomi Harte,et al.  Viseme definitions comparison for visual-only speech recognition , 2011, 2011 19th European Signal Processing Conference.

[14]  A. Montgomery,et al.  Physical characteristics of the lips underlying vowel lipreading performance. , 1983, The Journal of the Acoustical Society of America.

[15]  Denis Beautemps,et al.  A pilot study of temporal organization in Cued Speech production of French syllables: rules for a Cued Speech synthesizer , 2004, Speech Commun..

[16]  Jacqueline Leybaert,et al.  Cued Speech and Cued Language for Deaf and Hard of Hearing Children , 2010 .

[17]  Guillaume Gibert,et al.  Analysis and synthesis of the three-dimensional movements of the head, face, and hand of a speaker using cued speech. , 2005, The Journal of the Acoustical Society of America.

[18]  Thomas Hueber,et al.  Feature extraction using multimodal convolutional neural networks for visual speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Denis Beautemps,et al.  Feature adaptation of hearing-impaired lip shapes: the vowel case in the cued speech context , 2008, INTERSPEECH.

[20]  M. C. Jones Cued speech. , 1992, ASHA.

[21]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[23]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[24]  Sotiris Malassiotis,et al.  Comparison of 2D and 3D Analysis For Automated Cued Speech Gesture Recognition , 2004 .

[25]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.