Integrating Articulatory Information in Deep Learning-Based Text-to-Speech Synthesis

Articulatory information has been shown to be effective in improving the performance of hidden Markov model (HMM)based text-to-speech (TTS) synthesis. Recently, deep learningbased TTS has outperformed HMM-based approaches. However, articulatory information has rarely been integrated in deep learning-based TTS. This paper investigated the effectiveness of integrating articulatory movement data to deep learning-based TTS. The integration of articulatory information was achieved in two ways: (1) direct integration, where articulatory and acoustic features were the output of a deep neural network (DNN), and (2) direct integration plus forward-mapping, where the output articulatory features were mapped to acoustic features by an additional DNN; These forward-mapped acoustic features were then combined with the output acoustic features to produce the final acoustic features. Articulatory (tongue and lip) and acoustic data collected from male and female speakers were used in the experiment. Both objective measures and subjective judgment by human listeners showed the approaches integrated articulatory information outperformed the baseline approach (without using articulatory information) in terms of naturalness and speaker voice identity (voice similarity).

[1]  Frank K. Soong,et al.  On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Ashok Samal,et al.  An Optimal Set of Flesh Points on Tongue and Lips for Speech-Movement Classification. , 2016, Journal of speech, language, and hearing research : JSLHR.

[3]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Simon King,et al.  The voice bank corpus: Design, collection and data analysis of a large regional accent speech database , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[5]  An Ji,et al.  Speaker independent acoustic-to-articulatory inversion , 2014 .

[6]  Simon King,et al.  Towards Personalised Synthesised Voices for Individuals with Vocal Disabilities: Voice Banking and Reconstruction , 2013, SLPAT.

[7]  Alex Acero,et al.  Recent improvements on Microsoft's trainable text-to-speech system-Whistler , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Laurent Girin,et al.  Robust articulatory speech synthesis using deep neural networks for BCI applications , 2014, INTERSPEECH.

[9]  Yu-Tsai Wang,et al.  Tongue-surface movement patterns during speech and swallowing. , 2003, The Journal of the Acoustical Society of America.

[10]  Brad H. Story,et al.  An acoustically-driven vocal tract model for stop consonant production , 2017, Speech Commun..

[11]  Petr Motlícek,et al.  Idlak Tangle: An Open Source Kaldi Based Parametric Speech Synthesiser Based on DNN , 2016, INTERSPEECH.

[12]  Ren-Hua Wang,et al.  Integrating Articulatory Features Into HMM-Based Parametric Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Stuart Cunningham,et al.  Reconstructing the Voice of an Individual Following Laryngectomy , 2011, Augmentative and alternative communication.

[14]  Amarendar Reddy,et al.  Communication of Dumb & Blind People with T -T-S , 2012 .

[15]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[16]  Jeffrey J Berry,et al.  Accuracy of the NDI wave speech research system. , 2011, Journal of speech, language, and hearing research : JSLHR.

[17]  D. Beukelman,et al.  Augmentative & Alternative Communication: Supporting Children & Adults With Complex Communication Needs , 2006 .

[18]  S. King,et al.  Speech synthesis technologies for individuals with vocal disabilities: Voice banking and reconstruction , 2012 .

[19]  Jun Wang,et al.  Parkinson's condition estimation using speech acoustic and inversely mapped articulatory data , 2015, INTERSPEECH.

[20]  Jun Wang,et al.  Determining an Optimal Set of Flesh Points on Tongue, Lips, and Jaw for Continuous Silent Speech Recognition , 2015, SLPAT@Interspeech.

[21]  Jun Wang,et al.  Speaker-independent silent speech recognition with across-speaker articulatory normalization and speaker adaptive training , 2015, INTERSPEECH.

[22]  An Ji,et al.  Parallel Reference Speaker Weighting for Kinematic-Independent Acoustic-to-Articulatory Inversion , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[24]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[25]  Moncef Gabbouj,et al.  Prediction of Voice Aperiodicity Based on Spectral Representations in HMM Speech Synthesis , 2011, INTERSPEECH.

[26]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[27]  Thomas Portele,et al.  Adapting a TTS system to a reading machine for the blind , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[28]  Peter Birkholz,et al.  Manipulation of the prosodic features of vocal tract length, nasality and articulatory precision using articulatory synthesis , 2017, Comput. Speech Lang..

[29]  John Goddard Close,et al.  Speech Synthesis Based on Hidden Markov Models and Deep Learning , 2016, Res. Comput. Sci..

[30]  Masanori Morise,et al.  D4C, a band-aperiodicity estimator for high-quality speech synthesis , 2016, Speech Commun..

[31]  J. M. Gilbert,et al.  Silent speech interfaces , 2010, Speech Commun..

[32]  Junichi Yamagishi,et al.  An Introduction to HMM-Based Speech Synthesis , 2006 .

[33]  Ashok Samal,et al.  Articulatory distinctiveness of vowels and consonants: a data-driven approach. , 2013, Journal of speech, language, and hearing research : JSLHR.

[34]  Heiga Zen,et al.  Autoregressive Models for Statistical Parametric Speech Synthesis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Jun Wang,et al.  Preliminary Test of a Real-Time, Interactive Silent Speech Interface Based on Electromagnetic Articulograph , 2014, SLPAT@ACL.

[36]  Ricardo Gutierrez-Osuna,et al.  Data driven articulatory synthesis with deep neural networks , 2016, Comput. Speech Lang..