Towards a segmental vocoder driven by ultrasound and optical images of the tongue and lips

This article presents a framework for a phonetic vocoder driven by ultrasound and optical images of the tongue and lips for a “silent speech interface” application. The system is built around an HMM-based visual phone recognition step which provides target phonetic sequences from a continuous visual observation stream. The phonetic target constrains the search for the optimal sequence of diphones that maximizes similarity to the input test data in visual space subject to a unit concatenation cost in the acoustic domain. The final speech waveform is generated using “Harmonic plus Noise Model” synthesis techniques. Experimental results are based on a onehour continuous speech audiovisual database comprising ultrasound images of the tongue and both frontal and lateral view of the speaker’s lips.

[1]  Gérard Chollet,et al.  Continuous-speech phone recognition from ultrasound and optical images of the tongue and lips , 2007, INTERSPEECH.

[2]  L. Maier-Hein,et al.  Session independent non-audible speech recognition using surface electromyography , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[3]  Thierry Dutoit,et al.  Diphone concatenation using a harmonic plus noise model of speech , 1997, EUROSPEECH.

[4]  Gérard Chollet,et al.  Phone recognition from ultrasound and optical video sequences for a silent speech interface , 2008, INTERSPEECH.

[5]  Gérard Chollet,et al.  An ultrasound‐based silent speech interface , 2008 .

[6]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[7]  M Stone,et al.  A head and transducer support system for making ultrasound images of tongue/jaw movement. , 1995, The Journal of the Acoustical Society of America.

[8]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[9]  Chalapathy Neti,et al.  Asynchrony modeling for audio-visual speech recognition , 2002 .

[10]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[11]  Scott T. Acton,et al.  Speckle reducing anisotropic diffusion , 2002, IEEE Trans. Image Process..

[12]  B. Efron Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods , 1981 .

[13]  Gérard Chollet,et al.  Eigentongue Feature Extraction for an Ultrasound-Based Silent Speech Interface , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.