Speaker dependent articulatory-to-acoustic mapping using real-time MRI of the vocal tract

Articulatory-to-acoustic (forward) mapping is a technique to predict speech using various articulatory acquisition techniques (e.g. ultrasound tongue imaging, lip video). Real-time MRI (rtMRI) of the vocal tract has not been used before for this purpose. The advantage of MRI is that it has a high `relative' spatial resolution: it can capture not only lingual, labial and jaw motion, but also the velum and the pharyngeal region, which is typically not possible with other techniques. In the current paper, we train various DNNs (fully connected, convolutional and recurrent neural networks) for articulatory-to-speech conversion, using rtMRI as input, in a speaker-specific way. We use two male and two female speakers of the USC-TIMIT articulatory database, each of them uttering 460 sentences. We evaluate the results with objective (Normalized MSE and MCD) and subjective measures (perceptual test) and show that CNN-LSTM networks are preferred which take multiple images as input, and achieve MCD scores between 2.8-4.5 dB. In the experiments, we find that the predictions of speaker `m1' are significantly weaker than other speakers. We show that this is caused by the fact that 74% of the recordings of speaker `m1' are out of sync.

[1]  Pramit Saha,et al.  Towards Automatic Speech Identification from Vocal Tract Shape Dynamics in Real-time MRI , 2018, INTERSPEECH.

[2]  S. Imai,et al.  Mel Log Spectrum Approximation (MLSA) filter for speech synthesis , 1983 .

[3]  Jun Wang,et al.  Sentence recognition from articulatory movements for silent speech interfaces , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Bin Liu,et al.  Estimate articulatory MRI series from acoustic signal using deep architecture , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Shrikanth Narayanan,et al.  Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (TC). , 2014, The Journal of the Acoustical Society of America.

[6]  Prasanta Kumar Ghosh,et al.  An Investigation on Speaker Specific Articulatory Synthesis with Speaker Independent Articulatory Inversion , 2019, INTERSPEECH.

[7]  Shrikanth S. Narayanan,et al.  Database of Volumetric and Real-Time Vocal Tract MRI for Speech Science , 2017, INTERSPEECH.

[8]  Matthias Janke,et al.  EMG-to-Speech: Direct Generation of Speech From Facial Electromyographic Signals , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Bruce Denby,et al.  Speech synthesis from real time ultrasound images of the tongue , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Shrikanth S. Narayanan,et al.  Articulatory Synthesis Based on Real-Time Magnetic Resonance Imaging Data , 2016, INTERSPEECH.

[11]  Tanja Schultz,et al.  Domain-Adversarial Training for Session Independent EMG-based Speech Recognition , 2018, INTERSPEECH.

[12]  Gábor Gosztolya,et al.  DNN-Based Ultrasound-to-Speech Conversion for a Silent Speech Interface , 2017, INTERSPEECH.

[13]  Shrikanth S. Narayanan,et al.  Analysis of speech production real-time MRI , 2018, Comput. Speech Lang..

[14]  Steve Renals,et al.  Synchronising audio and ultrasound by learning cross-modal embeddings , 2019, INTERSPEECH.

[15]  J. M. Gilbert,et al.  Silent speech interfaces , 2010, Speech Commun..

[16]  Chiranjeevi Yarra,et al.  Automatic Visual Augmentation for Concatenation Based Synthesized Articulatory Videos from Real-time MRI Data for Spoken Language Training , 2018, INTERSPEECH.

[17]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[18]  Hemant A. Patil,et al.  Effectiveness of Generative Adversarial Network for Non-Audible Murmur-to-Whisper Speech Conversion , 2018, INTERSPEECH.

[19]  Liangliang Cao,et al.  Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Gérard Chollet,et al.  Statistical Mapping Between Articulatory and Acoustic Data for an Ultrasound-Based Silent Speech Interface , 2011, INTERSPEECH.

[21]  Myung Jong Kim,et al.  Articulation-to-Speech Synthesis Using Articulatory Flesh Point Sensors' Orientation Information , 2018, INTERSPEECH.

[22]  Tamás Gábor Csapó,et al.  Ultrasound-Based Silent Speech Interface Using Convolutional and Recurrent Neural Networks , 2019, Acta Acustica united with Acustica.

[23]  Shrikanth Narayanan,et al.  A modular architecture for articulatory synthesis from gestural specification. , 2019, The Journal of the Acoustical Society of America.

[24]  Petros Maragos,et al.  Multi-View Audio-Articulatory Features for Phonetic Recognition on RTMRI-TIMIT Database , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Gábor Gosztolya,et al.  Ultrasound-based Silent Speech Interface Built on a Continuous Vocoder , 2019, INTERSPEECH.

[26]  Shrikanth Narayanan,et al.  Advances in vocal tract imaging and analysis , 2019, The Routledge Handbook of Phonetics.

[27]  Pierre Roussel-Ragot,et al.  An Articulatory-Based Singing Voice Synthesis Using Tongue and Lips Imaging , 2016, INTERSPEECH.

[28]  Tokihiko Kaburagi,et al.  Articulatory-to-speech Conversion Using Bi-directional Long Short-term Memory , 2018, INTERSPEECH.

[29]  J. M. Gilbert,et al.  Development of a (silent) speech recognition system for patients following laryngectomy. , 2008, Medical engineering & physics.

[30]  Athanasios Katsamanis,et al.  Validating rt-MRI Based Articulatory Representations via Articulatory Recognition , 2011, INTERSPEECH.

[31]  K. G. van Leeuwen,et al.  CNN-Based Phoneme Classifier from Vocal Tract MRI Learns Embedding Consistent with Articulatory Topology , 2019, INTERSPEECH.

[32]  Gábor Gosztolya,et al.  F0 Estimation for DNN-Based Ultrasound Silent Speech Interfaces , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Phil D. Green,et al.  Direct Speech Reconstruction From Articulatory Sensor Data by Machine Learning , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.