Synthesizing speech from Doppler signals

It has long been considered a desirable goal to be able to construct an intelligible speech signal merely by observing the talker in the act of speaking. Past methods at performing this have been based on camera-based observations of the talker's face, combined with statistical methods that infer the speech signal from the facial motion captured by the camera. Other methods have included synthesis of speech from measurements taken by electro-myelo graphs and other devices that are tethered to the talker - an undesirable setup. In this paper we present a new device for synthesizing speech from characterizations of facial motion associated with speech - a Doppler sonar. Facial movement is characterized through Doppler frequency shifts in a tone that is incident on the talker's face. These frequency shifts are used to infer the underlying speech signal. The setup is farfield and untethered, with the sonar acting from the distance of a regular desktop microphone. Preliminary experimental evaluations show that the mechanism is very promising - we are able to synthesize reasonable speech signals, comparable to those obtained from tethered devices such as EMGs.

[1]  Tanja Schultz,et al.  Synthesizing speech from electromyography using voice transformation techniques , 2009, INTERSPEECH.

[2]  Gérard Chollet,et al.  Visuo-phonetic decoding using multi-stream and context-dependent models for an ultrasound-based silent speech interface , 2009, INTERSPEECH.

[3]  John F. Holzrichter Characterizing silent and pseudo-silent speech using radar-like sensors , 2009, INTERSPEECH.

[4]  Bhiksha Raj,et al.  Ultrasonic Doppler sensor for speaker recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  James R. Glass,et al.  Multimodal speech recognition with ultrasonic sensors , 2007, INTERSPEECH.

[6]  Keiichi Tokuda,et al.  Mapping from articulatory movements to vocal tract spectrum with Gaussian mixture model for articulatory speech synthesis , 2004, SSW.

[7]  Keiichi Tokuda,et al.  Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[8]  Bhiksha Raj,et al.  Ultrasonic sensing for robust speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Tomoki Toda,et al.  Improving body transmitted unvoiced speech with statistical voice conversion , 2006, INTERSPEECH.

[10]  B. Yegnanarayana,et al.  Voice conversion: Factors responsible for quality , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Bhiksha Raj,et al.  Ultrasonic Doppler Sensor for Voice Activity Detection , 2007, IEEE Signal Processing Letters.