Optimizing the Ultrasound Tongue Image Representation for Residual Network-Based Articulatory-to-Acoustic Mapping

Within speech processing, articulatory-to-acoustic mapping (AAM) methods can apply ultrasound tongue imaging (UTI) as an input. (Micro)convex transducers are mostly used, which provide a wedge-shape visual image. However, this process is optimized for the visual inspection of the human eye, and the signal is often post-processed by the equipment. With newer ultrasound equipment, now it is possible to gain access to the raw scanline data (i.e., ultrasound echo return) without any internal post-processing. In this study, we compared the raw scanline representation with the wedge-shaped processed UTI as the input for the residual network applied for AAM, and we also investigated the optimal size of the input image. We found no significant differences between the performance attained using the raw data and the wedge-shaped image extrapolated from it. We found the optimal pixel size to be 64 × 43 in the case of the raw scanline input, and 64 × 64 when transformed to a wedge. Therefore, it is not necessary to use the full original 64 × 842 pixels raw scanline, but a smaller image is enough. This allows for the building of smaller networks, and will be beneficial for the development of session and speaker-independent methods for practical applications. AAM systems have the target application of a “silent speech interface”, which could be helpful for the communication of the speaking-impaired, in military applications, or in extremely noisy conditions.

[1]  J. Wang,et al.  Speaker Adaptation on Articulation and Acoustics for Articulation-to-Speech Synthesis , 2022, Sensors.

[2]  A. Wrench,et al.  Beyond the Edge: Markerless Pose Estimation of Speech Articulators from Ultrasound and Camera Images Using DeepLabCut , 2022, Sensors.

[3]  Samuel S. Silva,et al.  Exploring Silent Speech Interfaces Based on Frequency-Modulated Continuous-Wave Radar , 2022, Sensors.

[4]  J. Rekimoto,et al.  SSR7000: A Synchronized Corpus of Ultrasound Tongue Imaging for End-to-End Silent Speech Recognition , 2022, LREC.

[5]  Amin Honarmandi Shandiz,et al.  Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces , 2021, Interspeech.

[6]  G'abor Gosztolya,et al.  Improving Neural Silent Speech Interface Models by Adversarial Training , 2021, AICV.

[7]  Amin Honarmandi Shandiz,et al.  Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders , 2021, 2021 29th European Signal Processing Conference (EUSIPCO).

[8]  Steve Renals,et al.  Silent versus modal multi-speaker speech recognition from ultrasound and video , 2021, Interspeech.

[9]  Steve Renals,et al.  Tal: A Synchronised Multi-Speaker Corpus of Ultrasound Tongue Imaging, Audio, and Lip Videos , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[10]  László Czap,et al.  Impact of Preprocessing Features on the Performance of Ultrasound Tongue Contour Tracking, via Dynamic Programming , 2021 .

[11]  Pierre Roussel,et al.  Creating Song From Lip and Tongue Videos With a Convolutional Vocoder , 2021, IEEE Access.

[12]  Angel M. Gomez,et al.  Silent Speech Interfaces for Speech Restoration: A Review , 2020, IEEE Access.

[13]  Tam'as G'abor Csap'o,et al.  Ultrasound-based Articulatory-to-Acoustic Mapping with WaveGlow Speech Synthesis , 2020, INTERSPEECH.

[14]  Kele Xu,et al.  Quantification of Transducer Misalignment in Ultrasound Tongue Imaging , 2020, INTERSPEECH.

[15]  Peter Birkholz,et al.  Cross-Speaker Silent-Speech Command Word Recognition Using Electro-Optical Stomatography , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Tekla Etelka Gráczi,et al.  Transducer Misalignment in Ultrasound Tongue Imaging , 2020 .

[17]  Tamás Gábor Csapó,et al.  Ultrasound-Based Silent Speech Interface Using Convolutional and Recurrent Neural Networks , 2019, Acta Acustica united with Acustica.

[18]  Gábor Gosztolya,et al.  Ultrasound-based Silent Speech Interface Built on a Continuous Vocoder , 2019, INTERSPEECH.

[19]  Jun Rekimoto,et al.  SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks , 2019, CHI.

[20]  Steve Renals,et al.  Speaker-independent Classification of Phonetic Segments from Raw Ultrasound in Child Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Gábor Gosztolya,et al.  Autoencoder-Based Articulatory-to-Acoustic Mapping for Ultrasound Silent Speech Interfaces , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[22]  Steven M. Lulich,et al.  Acquiring and visualizing 3D/4D ultrasound recordings of tongue motion , 2018, J. Phonetics.

[23]  Yuanchun Shi,et al.  Lip-Interact: Improving Mobile Device Interaction with Silent Speech Commands , 2018, UIST.

[24]  Alan Wrench,et al.  UltraSuite: A Repository of Ultrasound and Acoustic Data from Child Speech Therapy Sessions , 2018, INTERSPEECH.

[25]  Myung Jong Kim,et al.  Articulation-to-Speech Synthesis Using Articulatory Flesh Point Sensors' Orientation Information , 2018, INTERSPEECH.

[26]  Tokihiko Kaburagi,et al.  Articulatory-to-speech Conversion Using Bi-directional Long Short-term Memory , 2018, INTERSPEECH.

[27]  Tanja Schultz,et al.  Domain-Adversarial Training for Session Independent EMG-based Speech Recognition , 2018, INTERSPEECH.

[28]  Gábor Gosztolya,et al.  F0 Estimation for DNN-Based Ultrasound Silent Speech Interfaces , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Bruce Denby,et al.  Updating the silent speech challenge benchmark with deep learning , 2017, Speech Commun..

[30]  Phil D. Green,et al.  Direct Speech Reconstruction From Articulatory Sensor Data by Machine Learning , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Matthias Janke,et al.  EMG-to-Speech: Direct Generation of Speech From Facial Electromyographic Signals , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  Myungjong Kim,et al.  Speaker-Independent Silent Speech Recognition From Flesh-Point Articulatory Movements Using an LSTM Neural Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Tanja Schultz,et al.  Biosignal-Based Spoken Communication: A Survey , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34]  Gábor Gosztolya,et al.  DNN-Based Ultrasound-to-Speech Conversion for a Silent Speech Interface , 2017, INTERSPEECH.

[35]  Thomas Hueber,et al.  Feature extraction using multimodal convolutional neural networks for visual speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Shmuel Peleg,et al.  Vid2speech: Speech reconstruction from silent video , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Pierre Roussel-Ragot,et al.  An Articulatory-Based Singing Voice Synthesis Using Tongue and Lips Imaging , 2016, INTERSPEECH.

[38]  Jürgen Schmidhuber,et al.  Lipreading with long short-term memory , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Jianwu Dang,et al.  Mapping ultrasound-based articulatory images and vowel sounds with a deep neural network framework , 2016, Multimedia Tools and Applications.

[41]  Tanja Schultz,et al.  Direct conversion from facial myoelectric signals to speech using Deep Neural Networks , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[42]  Tamás Gábor Csapó,et al.  Error analysis of extracted tongue contours from 2d ultrasound images , 2015, INTERSPEECH.

[43]  António J. S. Teixeira,et al.  Enhancing multimodal silent speech interfaces with feature selection , 2014, INTERSPEECH.

[44]  Tanja Schultz,et al.  Further investigations on EMG-to-speech conversion , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Jun Wang,et al.  Sentence recognition from articulatory movements for silent speech interfaces , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Gérard Chollet,et al.  Towards a Practical Silent Speech Interface Based on Vocal Tract Imaging , 2011 .

[47]  Gérard Chollet,et al.  Statistical Mapping Between Articulatory and Acoustic Data for an Ultrasound-Based Silent Speech Interface , 2011, INTERSPEECH.

[48]  J. M. Gilbert,et al.  Silent speech interfaces , 2010, Speech Commun..

[49]  Gérard Chollet,et al.  Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips , 2010, Speech Commun..

[50]  J. M. Gilbert,et al.  Development of a (silent) speech recognition system for patients following laryngectomy. , 2008, Medical engineering & physics.

[51]  Gérard Chollet,et al.  Eigentongue Feature Extraction for an Ultrasound-Based Silent Speech Interface , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[52]  M. Stone A guide to analysing tongue motion from ultrasound images , 2005, Clinical linguistics & phonetics.

[53]  L. Maier-Hein,et al.  Session independent non-audible speech recognition using surface electromyography , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[54]  Bruce Denby,et al.  Speech synthesis from real time ultrasound images of the tongue , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[55]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[56]  S. Imai,et al.  Mel Log Spectrum Approximation (MLSA) filter for speech synthesis , 1983 .