Real-time control of a DNN-based articulatory synthesizer for silent speech conversion: a pilot study

This article presents a pilot study on the real-time control of an articulatory synthesizer based on deep neural network (DNN), in the context of silent speech interface. The underlying hypothesis is that a silent speaker could benefit from real-time audio feedback to regulate his/her own production. In this study, we use 3D electromagnetic-articulography (EMA) to capture speech articulation, a DNN to convert EMA to spectral trajectories in real-time, and a standard vocoder excited by white noise for audio synthesis. As shown by recent literature on silent speech, adaptation of the articulo-acoustic modeling process is needed to account for possible inconsistencies between the initial training phase and practical usage conditions. In this study, we focus on different sensor setups across sessions (for the same speaker). Model adaptation is performed by cascading another neural network to the DNN used for articulatory-to-acoustic mapping. The intelligibility of the synthetic speech signal converted in real-time is evaluated using both objective and perceptual measurements.

[1]  Gérard Chollet,et al.  Visuo-phonetic decoding using multi-stream and context-dependent models for an ultrasound-based silent speech interface , 2009, INTERSPEECH.

[2]  Tanja Schultz,et al.  Session-independent EMG-based Speech Recognition , 2011, BIOSIGNALS.

[3]  Tanja Schultz,et al.  Impact of lack of acoustic feedback in EMG-based silent speech recognition , 2010, INTERSPEECH.

[4]  J. M. Gilbert,et al.  Development of a (silent) speech recognition system for patients following laryngectomy. , 2008, Medical engineering & physics.

[5]  Tanja Schultz,et al.  Further investigations on EMG-to-speech conversion , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  S. Imai,et al.  Mel Log Spectrum Approximation (MLSA) filter for speech synthesis , 1983 .

[7]  A. Packman,et al.  Altered auditory feedback and the treatment of stuttering: a review. , 2006, Journal of fluency disorders.

[8]  Kiyohiro Shikano,et al.  Non-Audible Murmur (NAM) Recognition , 2006, IEICE Trans. Inf. Syst..

[9]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[10]  Gérard Bailly,et al.  Differences in articulatory strategies between silent, whispered and normal speech ? A pilot study using ElectroMagnetic Articulography , 2011 .

[11]  Laurent Girin,et al.  Robust articulatory speech synthesis using deep neural networks for BCI applications , 2014, INTERSPEECH.

[12]  Gérard Chollet,et al.  Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips , 2010, Speech Commun..

[13]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[14]  Tanja Schultz,et al.  Towards real-life application of EMG-based speech recognition by using unsupervised adaptation , 2014, INTERSPEECH.

[15]  Kiyohiro Shikano,et al.  Non-audible murmur recognition input interface using stethoscopic microphone attached to the skin , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[16]  J. M. Gilbert,et al.  Silent speech interfaces , 2010, Speech Commun..

[17]  Phil D. Green,et al.  Isolated word recognition of silent speech using magnetic implants and sensors. , 2010, Medical engineering & physics.

[18]  Tanja Schultz,et al.  Modeling coarticulation in EMG-based continuous speech recognition , 2010, Speech Commun..

[19]  Thomas Hueber,et al.  Statistical conversion of silent articulation into audible speech using full-covariance HMM , 2016, Comput. Speech Lang..