DNN-Based Ultrasound-to-Speech Conversion for a Silent Speech Interface

In this paper we present our initial results in articulatory-toacoustic conversion based on tongue movement recordings using Deep Neural Networks (DNNs). Despite the fact that deep learning has revolutionized several fields, so far only a few researchers have applied DNNs for this task. Here, we compare various possible feature representation approaches combined with DNN-based regression. As the input, we recorded synchronized 2D ultrasound images and speech signals. The task of the DNN was to estimate Mel-Generalized Cepstrum-based Line Spectral Pair (MGC-LSP) coefficients, which then served as input to a standard pulse-noise vocoder for speech synthesis. As the raw ultrasound images have a relatively high resolution, we experimented with various feature selection and transformation approaches to reduce the size of the feature vectors. The synthetic speech signals resulting from the various DNN configurations were evaluated both using objective measures and a subjective listening test. We found that the representation that used several neighboring image frames in combination with a feature selection method was preferred both by the subjects taking part in the listening experiments, and in terms of the Normalized Mean Squared Error. Our results may be useful for creating Silent Speech Interface applications in the future.

[1]  Gérard Chollet,et al.  Eigentongue Feature Extraction for an Ultrasound-Based Silent Speech Interface , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[2]  Bruce Denby,et al.  Speech synthesis from real time ultrasound images of the tongue , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Thomas Hueber,et al.  Statistical conversion of silent articulation into audible speech using full-covariance HMM , 2016, Comput. Speech Lang..

[5]  J. M. Gilbert,et al.  Silent speech interfaces , 2010, Speech Commun..

[6]  Jun Wang,et al.  Preliminary Test of a Real-Time, Interactive Silent Speech Interface Based on Electromagnetic Articulograph , 2014, SLPAT@ACL.

[7]  Gérard Chollet,et al.  Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips , 2010, Speech Commun..

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Pierre Roussel-Ragot,et al.  An Articulatory-Based Singing Voice Synthesis Using Tongue and Lips Imaging , 2016, INTERSPEECH.

[10]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[11]  Tanja Schultz,et al.  Direct conversion from facial myoelectric signals to speech using Deep Neural Networks , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[12]  S. Imai,et al.  Mel Log Spectrum Approximation (MLSA) filter for speech synthesis , 1983 .

[13]  António J. S. Teixeira,et al.  Enhancing multimodal silent speech interfaces with feature selection , 2014, INTERSPEECH.

[14]  Phil D. Green,et al.  A silent speech system based on permanent magnet articulography and direct synthesis , 2016, Comput. Speech Lang..

[15]  George H. Weiss,et al.  Analysis of real-time ultrasound images of tongue configuration using a grid-digitizing system , 1983 .

[16]  Milos Cernak,et al.  Modeling unvoiced sounds in statistical parametric speech synthesis with a continuous vocoder , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[17]  Laurent Girin,et al.  Robust articulatory speech synthesis using deep neural networks for BCI applications , 2014, INTERSPEECH.

[18]  M. Stone A guide to analysing tongue motion from ultrasound images , 2005, Clinical linguistics & phonetics.

[19]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[20]  John G Harris,et al.  A sawtooth waveform inspired pitch estimator for speech and music. , 2008, The Journal of the Acoustical Society of America.

[21]  Gérard Chollet,et al.  Statistical Mapping Between Articulatory and Acoustic Data for an Ultrasound-Based Silent Speech Interface , 2011, INTERSPEECH.

[22]  Gérard Bailly,et al.  Continuous Articulatory-to-Acoustic Mapping using Phone-based Trajectory HMM for a Silent Speech Interface , 2012, INTERSPEECH.

[23]  Laurent Girin,et al.  Real-Time Control of an Articulatory-Based Speech Synthesizer for Brain Computer Interfaces , 2016, PLoS Comput. Biol..