Ultrasound-based Silent Speech Interface Built on a Continuous Vocoder

Recently it was shown that within the Silent Speech Interface (SSI) field, the prediction of F0 is possible from Ultrasound Tongue Images (UTI) as the articulatory input, using Deep Neural Networks for articulatory-to-acoustic mapping. Moreover, text-to-speech synthesizers were shown to produce higher quality speech when using a continuous pitch estimate, which takes non-zero pitch values even when voicing is not present. Therefore, in this paper on UTI-based SSI, we use a simple continuous F0 tracker which does not apply a strict voiced / unvoiced decision. Continuous vocoder parameters (ContF0, Maximum Voiced Frequency and Mel-Generalized Cepstrum) are predicted using a convolutional neural network, with UTI as input. The results demonstrate that during the articulatory-to-acoustic mapping experiments, the continuous F0 is predicted with lower error, and the continuous vocoder produces slightly more natural synthesized speech than the baseline vocoder using standard discontinuous F0.

[1]  S. Imai,et al.  Mel Log Spectrum Approximation (MLSA) filter for speech synthesis , 1983 .

[2]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[3]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[4]  Bruce Denby,et al.  Speech synthesis from real time ultrasound images of the tongue , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  J. M. Gilbert,et al.  Development of a (silent) speech recognition system for patients following laryngectomy. , 2008, Medical engineering & physics.

[6]  Gérard Chollet,et al.  Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips , 2010, Speech Commun..

[7]  J. M. Gilbert,et al.  Silent speech interfaces , 2010, Speech Commun..

[8]  Gérard Chollet,et al.  Statistical Mapping Between Articulatory and Acoustic Data for an Ultrasound-Based Silent Speech Interface , 2011, INTERSPEECH.

[9]  Tanja Schultz,et al.  Estimation of fundamental frequency from surface electromyographic data: EMG-to-F0 , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Kai Yu,et al.  Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Jun Wang,et al.  Sentence recognition from articulatory movements for silent speech interfaces , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Petr Motlícek,et al.  A Simple Continuous Pitch Estimation Algorithm , 2013, IEEE Signal Processing Letters.

[13]  Yannis Stylianou,et al.  Maximum Voiced Frequency Estimation: Exploiting Amplitude and Phase Spectra , 2014, IEEE Signal Processing Letters.

[14]  Milos Cernak,et al.  Residual-Based Excitation with Continuous F0 Modeling in HMM-Based Speech Synthesis , 2015, SLSP.

[15]  Li-Rong Dai,et al.  Articulatory-to-Acoustic Conversion with Cascaded Prediction of Spectral and Excitation Features Using Neural Networks , 2016, INTERSPEECH.

[16]  Pierre Roussel-Ragot,et al.  An Articulatory-Based Singing Voice Synthesis Using Tongue and Lips Imaging , 2016, INTERSPEECH.

[17]  Milos Cernak,et al.  Modeling unvoiced sounds in statistical parametric speech synthesis with a continuous vocoder , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[18]  Tamás Gábor Csapó,et al.  Continuous fundamental frequency prediction with deep neural networks , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[19]  Thomas Hueber,et al.  Feature extraction using multimodal convolutional neural networks for visual speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Phil D. Green,et al.  Direct Speech Reconstruction From Articulatory Sensor Data by Machine Learning , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Tanja Schultz,et al.  Biosignal-Based Spoken Communication: A Survey , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Gábor Gosztolya,et al.  DNN-Based Ultrasound-to-Speech Conversion for a Silent Speech Interface , 2017, INTERSPEECH.

[23]  Matthias Janke,et al.  EMG-to-Speech: Direct Generation of Speech From Facial Electromyographic Signals , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Tamás Gábor Csapó,et al.  Time-Domain Envelope Modulating the Noise Component of Excitation in a Continuous Residual-Based Vocoder for Statistical Parametric Speech Synthesis , 2017, INTERSPEECH.

[25]  Jianwu Dang,et al.  Prediction of F0 Based on Articulatory Features Using DNN , 2017, ISSP.

[26]  Myungjong Kim,et al.  Speaker-Independent Silent Speech Recognition From Flesh-Point Articulatory Movements Using an LSTM Neural Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Tanja Schultz,et al.  Domain-Adversarial Training for Session Independent EMG-based Speech Recognition , 2018, INTERSPEECH.

[28]  Tokihiko Kaburagi,et al.  Articulatory-to-speech Conversion Using Bi-directional Long Short-term Memory , 2018, INTERSPEECH.

[29]  Myung Jong Kim,et al.  Articulation-to-Speech Synthesis Using Articulatory Flesh Point Sensors' Orientation Information , 2018, INTERSPEECH.

[30]  Gábor Gosztolya,et al.  F0 Estimation for DNN-Based Ultrasound Silent Speech Interfaces , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Hemant A. Patil,et al.  Effectiveness of Generative Adversarial Network for Non-Audible Murmur-to-Whisper Speech Conversion , 2018, INTERSPEECH.

[32]  Liangliang Cao,et al.  Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Tamás Gábor Csapó,et al.  Ultrasound-Based Silent Speech Interface Using Convolutional and Recurrent Neural Networks , 2019, Acta Acustica united with Acustica.