Visuo-phonetic decoding using multi-stream and context-dependent models for an ultrasound-based silent speech interface

Recent improvements are presented for phonetic decoding of continuous-speech from ultrasound and optical observations of the tongue and lips in a silent speech interface application. In a new approach to this critical step, the visual streams are modeled by context-dependent multi-stream Hidden Markov Models (CD-MSHMM). Results are compared to a baseline system using context-independent modeling and a visual feature fusion strategy, with both systems evaluated on a onehour, phonetically balanced English speech database. Tongue and lip images are coded using PCA-based feature extraction techniques. The uttered speech signal, also recorded, is used to initialize the training of the visual HMMs. Visual phonetic decoding performance is evaluated successively with and without the help of linguistic constraints introduced via a 2.5kword decoding dictionary.

[1]  Gérard Chollet,et al.  Towards a segmental vocoder driven by ultrasound and optical images of the tongue and lips , 2008, INTERSPEECH.

[2]  Louis Goldstein,et al.  Gestural specification using dynamically-defined articulatory structures , 1990 .

[3]  M Stone,et al.  A head and transducer support system for making ultrasound images of tongue/jaw movement. , 1995, The Journal of the Acoustical Society of America.

[4]  Gerasimos Potamianos,et al.  Lipreading Using Profile Versus Frontal Views , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[5]  Kiyohiro Shikano,et al.  Non-audible murmur recognition input interface using stethoscopic microphone attached to the skin , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[6]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[7]  James R. Glass,et al.  Feature-based Pronunciation Modeling for Speech Recognition , 2004, HLT-NAACL.

[8]  Hervé Bourlard,et al.  A mew ASR approach based on independent processing and recombination of partial frequency bands , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[9]  Steve J. Young,et al.  Tree-Based State Tying for High Accuracy Modelling , 1994, HLT.

[10]  Kiyohiro Shikano,et al.  Non-audible murmur recognition , 2003, INTERSPEECH.

[11]  L. Maier-Hein,et al.  Session independent non-audible speech recognition using surface electromyography , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[12]  J. M. Gilbert,et al.  Development of a (silent) speech recognition system for patients following laryngectomy. , 2008, Medical engineering & physics.

[13]  Scott T. Acton,et al.  Speckle reducing anisotropic diffusion , 2002, IEEE Trans. Image Process..

[14]  Fenguangzhai Song CD , 1992 .

[15]  B. Efron Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods , 1981 .

[16]  Tomoki Toda,et al.  Improvement to a NAM-captured whisper-to-speech system , 2010, Speech Commun..

[17]  Gérard Chollet,et al.  Eigentongue Feature Extraction for an Ultrasound-Based Silent Speech Interface , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18]  Bruce Denby,et al.  Prospects for a Silent Speech Interface using Ultrasound Imaging , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.