Speaker-independent Classification of Phonetic Segments from Raw Ultrasound in Child Speech

Ultrasound tongue imaging (UTI) provides a convenient way to visualize the vocal tract during speech production. UTI is increasingly being used for speech therapy, making it important to develop automatic methods to assist various time-consuming manual tasks currently performed by speech therapists. A key challenge is to generalize the automatic processing of ultrasound tongue images to previously unseen speakers. In this work, we investigate the classification of phonetic segments (tongue shapes) from raw ultrasound recordings under several training scenarios: speaker-dependent, multi-speaker, speaker-independent, and speaker-adapted. We observe that models underperform when applied to data from speakers not seen at training time. However, when provided with minimal additional speaker information, such as the mean ultrasound frame, the models generalize better to unseen speakers.

[1]  Bryan Gick,et al.  Ultrasound Technology and SecondLanguage Acquisition Research , 2006 .

[2]  Bruce Denby,et al.  Speech synthesis from real time ultrasound images of the tongue , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Alan Wrench,et al.  UltraSuite: A Repository of Ultrasound and Acoustic Data from Child Speech Therapy Sessions , 2018, INTERSPEECH.

[4]  Gérard Chollet,et al.  Phone recognition from ultrasound and optical video sequences for a silent speech interface , 2008, INTERSPEECH.

[5]  C. Kambhamettu,et al.  Automatic contour tracking in ultrasound images , 2005, Clinical linguistics & phonetics.

[6]  Gérard Chollet,et al.  Eigentongue Feature Extraction for an Ultrasound-Based Silent Speech Interface , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  Gérard Chollet,et al.  Acquisition of Ultrasound, Video and Acoustic Speech Data for a Silent-Speech Interface Application , 2008 .

[8]  Gérard Chollet,et al.  Continuous-speech phone recognition from ultrasound and optical images of the tongue and lips , 2007, INTERSPEECH.

[9]  Thomas Hueber,et al.  Tongue tracking in ultrasound images using eigentongue decomposition and artificial neural networks , 2015, INTERSPEECH.

[10]  Michael Pucher,et al.  UltraFit: A Speaker-friendly Headset for Ultrasound Recordings in Speech Science , 2018, INTERSPEECH.

[11]  Laurent Girin,et al.  Automatic animation of an articulatory tongue model from ultrasound images of the vocal tract , 2017, Speech Commun..

[12]  Gábor Gosztolya,et al.  DNN-Based Ultrasound-to-Speech Conversion for a Silent Speech Interface , 2017, INTERSPEECH.

[13]  Thomas Hueber,et al.  Feature extraction using multimodal convolutional neural networks for visual speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Bryan Gick,et al.  11. Ultrasound imaging applications in second language acquisition , 2008 .

[16]  Bruce Denby,et al.  Updating the silent speech challenge benchmark with deep learning , 2017, Speech Commun..

[17]  J. M. Gilbert,et al.  Silent speech interfaces , 2010, Speech Commun..

[18]  M. Stone A guide to analysing tongue motion from ultrasound images , 2005, Clinical linguistics & phonetics.

[19]  J. Scobbie,et al.  Ultraphonix: using ultrasound visual biofeedback to teach children with special speech sound disorders new articulations , 2017 .

[20]  Bruce Denby,et al.  Comparison of DCT and autoencoder-based features for DNN-HMM multimodal silent speech recognition , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[21]  Tamás Gábor Csapó,et al.  Convolutional neural network-based automatic classification of midsagittal tongue gestural targets using B-mode ultrasound images. , 2017, The Journal of the Acoustical Society of America.

[22]  Gérard Chollet,et al.  Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips , 2010, Speech Commun..

[23]  J. Scobbie,et al.  Using ultrasound visual biofeedback to treat persistent primary speech sound disorders , 2015, Clinical linguistics & phonetics.

[24]  Gábor Gosztolya,et al.  F0 Estimation for DNN-Based Ultrasound Silent Speech Interfaces , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  James M Scobbie,et al.  Enabling New Articulatory Gestures in Children With Persistent Speech Sound Disorders Using Ultrasound Visual Biofeedback. , 2019, Journal of speech, language, and hearing research : JSLHR.

[26]  Tanja Schultz,et al.  Biosignal-Based Spoken Communication: A Survey , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.