Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips

This article presents a segmental vocoder driven by ultrasound and optical images (standard CCD camera) of the tongue and lips for a ''silent speech interface'' application, usable either by a laryngectomized patient or for silent communication. The system is built around an audio-visual dictionary which associates visual to acoustic observations for each phonetic class. Visual features are extracted from ultrasound images of the tongue and from video images of the lips using a PCA-based image coding technique. Visual observations of each phonetic class are modeled by continuous HMMs. The system then combines a phone recognition stage with corpus-based synthesis. In the recognition stage, the visual HMMs are used to identify phonetic targets in a sequence of visual features. In the synthesis stage, these phonetic targets constrain the dictionary search for the sequence of diphones that maximizes similarity to the input test data in the visual space, subject to a concatenation cost in the acoustic domain. A prosody-template is extracted from the training corpus, and the final speech waveform is generated using ''Harmonic plus Noise Model'' concatenative synthesis techniques. Experimental results are based on an audiovisual database containing 1h of continuous speech from each of two speakers.

[1]  Gérard Chollet,et al.  Acquisition of Ultrasound, Video and Acoustic Speech Data for a Silent-Speech Interface Application , 2008 .

[2]  L. Maier-Hein,et al.  Session independent non-audible speech recognition using surface electromyography , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[3]  Gérard Chollet,et al.  Continuous-speech phone recognition from ultrasound and optical images of the tongue and lips , 2007, INTERSPEECH.

[4]  Steve Young,et al.  Token passing: a simple conceptual model for connected speech recognition systems , 1989 .

[5]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[6]  J. M. Gilbert,et al.  Development of a (silent) speech recognition system for patients following laryngectomy. , 2008, Medical engineering & physics.

[7]  Jitendra Malik,et al.  Scale-Space and Edge Detection Using Anisotropic Diffusion , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  M. Kendall Probability and Statistical Inference , 1956, Nature.

[9]  Gérard Chollet,et al.  Phone recognition from ultrasound and optical video sequences for a silent speech interface , 2008, INTERSPEECH.

[10]  Vijay Parthasarathy,et al.  Obtaining a palatal trace for ultrasound images , 2004 .

[11]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[12]  Steve Young,et al.  The HTK book , 1995 .

[13]  Maureen Stone,et al.  A head and transducer support system for making ultrasound images of tongue/jaw movement. , 1994 .

[14]  Scott T. Acton,et al.  Speckle reducing anisotropic diffusion , 2002, IEEE Trans. Image Process..

[15]  Chuck,et al.  Sub Auditory Speech Recognition based on EMG/EPG Signals , 2022 .

[16]  B. Efron Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods , 1981 .

[17]  Tomoki Toda,et al.  Improvement to a NAM-captured whisper-to-speech system , 2010, Speech Commun..

[18]  Gérard Chollet,et al.  Eigentongue Feature Extraction for an Ultrasound-Based Silent Speech Interface , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[19]  Peter Birkholz,et al.  A three-dimensional model of the vocal tract for speech synthesis , 2003 .

[20]  Thierry Dutoit,et al.  Diphone concatenation using a harmonic plus noise model of speech , 1997, EUROSPEECH.

[21]  Chandra Kambhamettu,et al.  A task-specific contour tracker for ultrasound , 2000, Proceedings IEEE Workshop on Mathematical Methods in Biomedical Image Analysis. MMBIA-2000 (Cat. No.PR00737).

[22]  P. J. Green,et al.  Probability and Statistical Inference , 1978 .

[23]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[24]  Chalapathy Neti,et al.  Asynchrony modeling for audio-visual speech recognition , 2002 .

[25]  Jr. G. Forney,et al.  Viterbi Algorithm , 1973, Encyclopedia of Machine Learning.

[26]  Gerasimos Potamianos,et al.  Lipreading Using Profile Versus Frontal Views , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[27]  C. Kambhamettu,et al.  Automatic contour tracking in ultrasound images , 2005, Clinical linguistics & phonetics.

[28]  Gérard Chollet,et al.  Visuo-phonetic decoding using multi-stream and context-dependent models for an ultrasound-based silent speech interface , 2009, INTERSPEECH.

[29]  CholletGérard,et al.  Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips , 2010 .

[30]  A. Marchal,et al.  Speech production and speech modelling , 1990 .

[31]  Kiyohiro Shikano,et al.  A tissue-conductive acoustic sensor applied in speech recognition for privacy , 2005, sOc-EUSAI '05.

[32]  V. Kshirsagar,et al.  Face recognition using Eigenfaces , 2011, 2011 3rd International Conference on Computer Research and Development.

[33]  Shinji Maeda,et al.  Compensatory Articulation During Speech: Evidence from the Analysis and Synthesis of Vocal-Tract Shapes Using an Articulatory Model , 1990 .