Multi-Task Learning of Speech Recognition and Speech Synthesis Parameters for Ultrasound-based Silent Speech Interfaces

Silent Speech Interface systems apply two different strategies to solve the articulatory-to-acoustic conversion task. The recognition-and-synthesis approach applies speech recognition techniques to map the articulatory data to a textual transcript, which is then converted to speech by a conventional text-tospeech system. The direct synthesis approach seeks to convert the articulatory information directly to speech synthesis (vocoder) parameters. In both cases, deep neural networks are an evident and popular choice to learn the mapping task. Recognizing that the learning of speech recognition and speech synthesis targets (acoustic model states vs. vocoder parameters) are two closely related tasks over the same ultrasound tongue image input, here we experiment with the multi-task training of deep neural networks, which seeks to solve the two tasks simultaneously. Our results show that the parallel learning of the two types of targets is indeed beneficial for both tasks. Moreover, we obtained further improvements by using multi-task training as a weight initialization step before task-specific training. Overall, we report a relative error rate reduction of about 7% in both the speech recognition and the speech synthesis tasks.

[1]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[2]  Gábor Gosztolya,et al.  F0 Estimation for DNN-Based Ultrasound Silent Speech Interfaces , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Tibor Fegyó,et al.  Improved Recognition of Spontaneous Hungarian Speech—Morphological and Acoustic Modeling Techniques for a Less Resourced Task , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Phil D. Green,et al.  Direct Speech Reconstruction From Articulatory Sensor Data by Machine Learning , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Tara N. Sainath,et al.  Deep Neural Network Language Models , 2012, WLM@NAACL-HLT.

[6]  Myungjong Kim,et al.  Speaker-Independent Silent Speech Recognition From Flesh-Point Articulatory Movements Using an LSTM Neural Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Thomas Hueber,et al.  Feature extraction using multimodal convolutional neural networks for visual speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  James T. Heaton,et al.  Silent Speech Recognition as an Alternative Communication Device for Persons With Laryngectomy , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Tanja Schultz,et al.  Estimation of fundamental frequency from surface electromyographic data: EMG-to-F0 , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  J. M. Gilbert,et al.  Development of a (silent) speech recognition system for patients following laryngectomy. , 2008, Medical engineering & physics.

[11]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[12]  J. M. Gilbert,et al.  Silent speech interfaces , 2010, Speech Commun..

[13]  Gérard Chollet,et al.  Towards a Practical Silent Speech Interface Based on Vocal Tract Imaging , 2011 .

[14]  Tamás Gábor Csapó,et al.  Convolutional neural network-based automatic classification of midsagittal tongue gestural targets using B-mode ultrasound images. , 2017, The Journal of the Acoustical Society of America.

[15]  Jasha Droppo,et al.  Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  António J. S. Teixeira,et al.  Enhancing multimodal silent speech interfaces with feature selection , 2014, INTERSPEECH.

[17]  Laurent Girin,et al.  Real-Time Control of an Articulatory-Based Speech Synthesizer for Brain Computer Interfaces , 2016, PLoS Comput. Biol..

[18]  Pierre Roussel-Ragot,et al.  An Articulatory-Based Singing Voice Synthesis Using Tongue and Lips Imaging , 2016, INTERSPEECH.

[19]  Tanja Schultz,et al.  Direct conversion from facial myoelectric signals to speech using Deep Neural Networks , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[20]  Jasha Droppo,et al.  Multi-task learning in deep neural networks for improved phoneme recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Heiga Zen,et al.  Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.

[22]  Matthias Janke,et al.  EMG-to-Speech: Direct Generation of Speech From Facial Electromyographic Signals , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Gábor Gosztolya,et al.  DNN-Based Ultrasound-to-Speech Conversion for a Silent Speech Interface , 2017, INTERSPEECH.

[24]  Peter Bell,et al.  Regularization of context-dependent deep neural networks with context-independent multi-task training , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Gérard Chollet,et al.  Statistical Mapping Between Articulatory and Acoustic Data for an Ultrasound-Based Silent Speech Interface , 2011, INTERSPEECH.

[26]  Gérard Chollet,et al.  Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips , 2010, Speech Commun..

[27]  Li-Rong Dai,et al.  Articulatory-to-Acoustic Conversion with Cascaded Prediction of Spectral and Excitation Features Using Neural Networks , 2016, INTERSPEECH.

[28]  Jun Wang,et al.  Preliminary Test of a Real-Time, Interactive Silent Speech Interface Based on Electromagnetic Articulograph , 2014, SLPAT@ACL.

[29]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[30]  Bruce Denby,et al.  Speech synthesis from real time ultrasound images of the tongue , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  S. Imai,et al.  Mel Log Spectrum Approximation (MLSA) filter for speech synthesis , 1983 .

[32]  Simon King,et al.  Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Jun Wang,et al.  Sentence recognition from articulatory movements for silent speech interfaces , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  James T. Heaton,et al.  Towards a practical silent speech recognition system , 2014, INTERSPEECH.

[35]  Jianwu Dang,et al.  Prediction of F0 Based on Articulatory Features Using DNN , 2017, ISSP.

[36]  Tanja Schultz,et al.  Biosignal-Based Spoken Communication: A Survey , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.