Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders

Several approaches exist for the recording of articulatory movements, such as eletromagnetic and permanent magnetic articulagraphy, ultrasound tongue imaging and surface electromyography. Although magnetic resonance imaging (MRI) is more costly than the above approaches, the recent developments in this area now allow the recording of real-time MRI videos of the articulators with an acceptable resolution. Here, we experiment with the reconstruction of the speech signal from a real-time MRI recording using deep neural networks. Instead of estimating speech directly, our networks are trained to output a spectral vector, from which we reconstruct the speech signal using the WaveGlow neural vocoder. We compare the performance of three deep neural architectures for the estimation task, combining convolutional (CNN) and recurrence-based (LSTM) neural layers. Besides the mean absolute error (MAE) of our networks, we also evaluate our models by comparing the speech signals obtained using several objective speech quality metrics like the mean cepstral distortion (MCD), Short-Time Objective Intelligibility (STOI), Perceptual Evaluation of Speech Quality (PESQ) and Signal-to-Distortion Ratio (SDR). The results indicate that our approach can successfully reconstruct the gross spectral shape, but more improvements are needed to reproduce the fine spectral details.

[1]  Pramit Saha,et al.  Towards Automatic Speech Identification from Vocal Tract Shape Dynamics in Real-time MRI , 2018, INTERSPEECH.

[2]  Tamás Gábor Csapó,et al.  Convolutional neural network-based automatic classification of midsagittal tongue gestural targets using B-mode ultrasound images. , 2017, The Journal of the Acoustical Society of America.

[3]  Pierre Roussel-Ragot,et al.  An Articulatory-Based Singing Voice Synthesis Using Tongue and Lips Imaging , 2016, INTERSPEECH.

[4]  Gérard Chollet,et al.  Statistical Mapping Between Articulatory and Acoustic Data for an Ultrasound-Based Silent Speech Interface , 2011, INTERSPEECH.

[5]  Tanja Schultz,et al.  Direct conversion from facial myoelectric signals to speech using Deep Neural Networks , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[6]  J. M. Gilbert,et al.  Development of a (silent) speech recognition system for patients following laryngectomy. , 2008, Medical engineering & physics.

[7]  Shrikanth Narayanan,et al.  Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (TC). , 2014, The Journal of the Acoustical Society of America.

[8]  Phil D. Green,et al.  A silent speech system based on permanent magnet articulography and direct synthesis , 2016, Comput. Speech Lang..

[9]  Gábor Gosztolya,et al.  DNN-Based Ultrasound-to-Speech Conversion for a Silent Speech Interface , 2017, INTERSPEECH.

[10]  Petros Maragos,et al.  Multi-View Audio-Articulatory Features for Phonetic Recognition on RTMRI-TIMIT Database , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Tanja Schultz,et al.  Domain-Adversarial Training for Session Independent EMG-based Speech Recognition , 2018, INTERSPEECH.

[12]  Angel Manuel Gomez,et al.  A Deep Learning Loss Function Based on the Perceptual Evaluation of the Speech Quality , 2018, IEEE Signal Processing Letters.

[13]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[14]  Tam'as G'abor Csap'o Speaker dependent articulatory-to-acoustic mapping using real-time MRI of the vocal tract , 2020, INTERSPEECH.

[15]  Jun Wang,et al.  Sentence recognition from articulatory movements for silent speech interfaces , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Laurent Girin,et al.  Real-Time Control of an Articulatory-Based Speech Synthesizer for Brain Computer Interfaces , 2016, PLoS Comput. Biol..

[17]  Bin Liu,et al.  Estimate articulatory MRI series from acoustic signal using deep architecture , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  DeLiang Wang,et al.  On Adversarial Training and Loss Functions for Speech Enhancement , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  J. M. Gilbert,et al.  Silent speech interfaces , 2010, Speech Commun..

[20]  László Tóth,et al.  3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces , 2020, ICAISC.

[21]  James T. Heaton,et al.  Silent Speech Recognition as an Alternative Communication Device for Persons With Laryngectomy , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Myungjong Kim,et al.  Speaker-Independent Silent Speech Recognition From Flesh-Point Articulatory Movements Using an LSTM Neural Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[24]  Christian Dittmar,et al.  A Comparison of Recent Neural Vocoders for Speech Signal Reconstruction , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[25]  Tanja Schultz,et al.  Estimation of fundamental frequency from surface electromyographic data: EMG-to-F0 , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Tokihiko Kaburagi,et al.  Articulatory-to-speech Conversion Using Bi-directional Long Short-term Memory , 2018, INTERSPEECH.

[27]  James T. Heaton,et al.  Towards a practical silent speech recognition system , 2014, INTERSPEECH.

[28]  Jun Wang,et al.  Preliminary Test of a Real-Time, Interactive Silent Speech Interface Based on Electromagnetic Articulograph , 2014, SLPAT@ACL.

[29]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Sidney Fels,et al.  Ultra2Speech - A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images , 2020, MICCAI.

[31]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[32]  Tam'as G'abor Csap'o,et al.  Ultrasound-based Articulatory-to-Acoustic Mapping with WaveGlow Speech Synthesis , 2020, INTERSPEECH.

[33]  Shrikanth Narayanan,et al.  Advances in vocal tract imaging and analysis , 2019, The Routledge Handbook of Phonetics.

[34]  Quoc V. Le,et al.  Swish: a Self-Gated Activation Function , 2017, 1710.05941.

[35]  Shrikanth S. Narayanan,et al.  Articulatory Synthesis Based on Real-Time Magnetic Resonance Imaging Data , 2016, INTERSPEECH.

[36]  Athanasios Katsamanis,et al.  Validating rt-MRI Based Articulatory Representations via Articulatory Recognition , 2011, INTERSPEECH.

[37]  Thomas Hueber,et al.  Feature extraction using multimodal convolutional neural networks for visual speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Jesper Jensen,et al.  On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.