论文信息 - Towards a segmental vocoder driven by ultrasound and optical images of the tongue and lips

Towards a segmental vocoder driven by ultrasound and optical images of the tongue and lips

This article presents a framework for a phonetic vocoder driven by ultrasound and optical images of the tongue and lips for a “silent speech interface” application. The system is built around an HMM-based visual phone recognition step which provides target phonetic sequences from a continuous visual observation stream. The phonetic target constrains the search for the optimal sequence of diphones that maximizes similarity to the input test data in visual space subject to a unit concatenation cost in the acoustic domain. The final speech waveform is generated using “Harmonic plus Noise Model” synthesis techniques. Experimental results are based on a onehour continuous speech audiovisual database comprising ultrasound images of the tongue and both frontal and lateral view of the speaker’s lips.

Gérard Chollet | Bruce Denby | Thomas Hueber | Gérard Dreyfus | Maureen Stone

[1] Gérard Chollet,et al. Continuous-speech phone recognition from ultrasound and optical images of the tongue and lips , 2007, INTERSPEECH.

[2] L. Maier-Hein,et al. Session independent non-audible speech recognition using surface electromyography , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[3] Thierry Dutoit,et al. Diphone concatenation using a harmonic plus noise model of speech , 1997, EUROSPEECH.

[4] Gérard Chollet,et al. Phone recognition from ultrasound and optical video sequences for a silent speech interface , 2008, INTERSPEECH.

[5] Gérard Chollet,et al. An ultrasound‐based silent speech interface , 2008 .

[6] Jr. G. Forney,et al. The viterbi algorithm , 1973 .

[7] M Stone,et al. A head and transducer support system for making ultrasound images of tongue/jaw movement. , 1995, The Journal of the Acoustical Society of America.

[8] Alan W. Black,et al. Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[9] Chalapathy Neti,et al. Asynchrony modeling for audio-visual speech recognition , 2002 .

[10] Jonathan G. Fiscus,et al. Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[11] Scott T. Acton,et al. Speckle reducing anisotropic diffusion , 2002, IEEE Trans. Image Process..

[12] B. Efron. Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods , 1981 .

[13] Gérard Chollet,et al. Eigentongue Feature Extraction for an Ultrasound-Based Silent Speech Interface , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.