Automatic Detection of the Temporal Segmentation of Hand Movements in British English Cued Speech

Cued Speech (CS) is a multi-modal system, which complements the lip reading with manual hand cues in the phonetic level to make the spoken language visible. It has been found that lip and hand movements are asynchronous in CS, and thus the study of hand temporal organization is very important for the multi-modal CS feature fusion. In this work, we propose a novel diphthong-hand preceding model (D-HPM) by investigating the relationship between hand preceding time (HPT) and diphthong time instants in sentences for British English CS. Besides, we demonstrate that HPT of the first and second parts of diphthongs has a very strong correlation. Combining the monophthong-HPM (M-HPM) and D-HPM, we present a hybrid temporal segmentation detection algorithm (HTSDA) for the hand movement in CS. The evaluation of the proposed algorithm is carried out by a hand position recognition experiment using the multi-Gaussian classifier as well as the longshort term memory (LSTM). The results show that the HTSDA significantly improves the recognition performance compared with the baseline (i.e., audio-based segmentation) and the stateof-the-art M-HPM. To the best of our knowledge, this is the first work to study the temporal organization of hand movements in British English CS.

[1]  Li Liu,et al.  Visual Recognition of Continuous Cued Speech Using a Tandem CNN-HMM Approach , 2018, INTERSPEECH.

[2]  Denis Beautemps,et al.  Extraction automatique de contour de lèvre à partir du modèle CLNF (Automatic lip contour extraction using CLNF model)[In French] , 2016, JEPTALNRECITAL.

[3]  Barbara Dodd,et al.  Lip reading in infants: Attention to speech presented in- and out-of-synchrony , 1979, Cognitive Psychology.

[4]  Carol Padden,et al.  American Sign Language and reading ability in deaf children. , 2000 .

[5]  Emmanuel Ferragne,et al.  Formant frequencies of vowels in 13 accents of the British Isles , 2010, Journal of the International Phonetic Association.

[6]  Li Liu,et al.  Automatic Temporal Segmentation of Hand Movements for Hand Positions Recognition in French Cued Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  Daniel V Abreu,et al.  Podcasting: Contemporary patient education , 2008, Ear, nose, & throat journal.

[9]  F. Béchet LIA―PHON: Un système complet de phonétisation de textes , 2001 .

[10]  Li Liu,et al.  Modeling for Continuous Cued Speech Recognition in French using Advanced Machine Learning Methods , 2018 .

[11]  W. Stokoe,et al.  A dictionary of American sign language on linguistic principles , 1965 .

[12]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[13]  Steve Young,et al.  The HTK book , 1995 .

[14]  Li Liu,et al.  Inner Lips Parameter Estimation based on Adaptive Ellipse Model , 2017, AVSP.

[15]  Denis Beautemps,et al.  A pilot study of temporal organization in Cued Speech production of French syllables: rules for a Cued Speech synthesizer , 2004, Speech Commun..

[16]  Denis Beautemps,et al.  Temporal Measures of Hand and Speech Coordination During French Cued Speech Production , 2005, Gesture Workshop.

[17]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[19]  G. H. Nicholls,et al.  Cued Speech and the reception of spoken language. , 1982, Journal of speech and hearing research.

[20]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[21]  W. Stokoe,et al.  Sign language structure: an outline of the visual communication systems of the American deaf. 1960. , 1961, Journal of deaf studies and deaf education.

[22]  M. C. Jones Cued speech. , 1992, ASHA.

[23]  Maja Pantic,et al.  End-to-end visual speech recognition with LSTMS , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Li Liu,et al.  Inner lips feature extraction based on CLNF with hybrid dynamic template for Cued Speech , 2017, EURASIP J. Image Video Process..