Creating New Language and Voice Components for the Updated MaryTTS Text-to-Speech Synthesis Platform

We present a new workflow to create components for the MaryTTS text-to-speech synthesis platform, which is popular with researchers and developers, extending it to support new languages and custom synthetic voices. This workflow replaces the previous toolkit with an efficient, flexible process that leverages modern build automation and cloud-hosted infrastructure. Moreover, it is compatible with the updated MaryTTS architecture, enabling new features and state-of-the-art paradigms such as synthesis based on deep neural networks (DNNs). Like MaryTTS itself, the new tools are free, open source software (FOSS), and promote the use of open data.

[2]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[3]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[4]  Daan van Esch,et al.  Predicting Pronunciations with Syllabification and Stress with Recurrent Neural Networks , 2016, INTERSPEECH.

[5]  Sébastien Le Maguer,et al.  The “ Uprooted ” MaryTTS Entry for the Blizzard Challenge 2017 , 2017 .

[6]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[7]  Marc Schröder,et al.  The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching , 2003, Int. J. Speech Technol..

[8]  Sébastien Le Maguer,et al.  Toward the use of information density based descriptive features in HMM based speech synthesis , 2016 .

[9]  Marc Schröder,et al.  Multilingual Voice Creation Toolkit for the MARY TTS Platform , 2010, LREC.

[10]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[11]  Marc Schröder,et al.  Symbolic vs. acoustics-based style control for expressive unit selection , 2010, SSW.

[12]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[13]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[14]  Paul Lamere,et al.  Sphinx-4: a flexible open source framework for speech recognition , 2004 .

[15]  Kishore Prahallad,et al.  Sub-Phonetic Modeling For Capturing Pronunciation Variations For Conversational Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[16]  Marcela Charfuelan,et al.  Expressive speech synthesis in MARY TTS using audiobook data and emotionML , 2013, INTERSPEECH.

[17]  Morgan Sonderegger,et al.  Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[18]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.