Building Open Javanese and Sundanese Corpora for Multilingual Text-to-Speech

We present multi-speaker text-to-speech corpora for Javanese and Sundanese, the second and third largest languages of Indonesia spoken by well over a hundred million people. The key objectives were to collect high-quality data in an affordable way and to share the data publicly with the speech community. To achieve this, we collaborated with two local universities in Java and streamlined our recording and crowdsourcing processes to produce corpora consisting of 5,800 (Javanese) and 4,200 (Sundanese) mixed-gender recordings. We used these corpora to build several configurations of multi-speaker neural network-based text-to-speech systems for Javanese and Sundanese. Subjective evaluations performed on these configurations demonstrate that multilingual configurations for which Javanese and Sundanese are trained jointly with a larger corpus of Standard Indonesian significantly outperform the systems constructed from a single language. We hope that sharing these corpora publicly and presenting our multilingual approach to text-to-speech will help the community to scale up text-to-speech technologies to other lesser resourced languages of Indonesia.

[1]  Róbert Kjaran,et al.  Building ASR Corpora Using Eyra , 2017, INTERSPEECH.

[2]  Heiga Zen,et al.  Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices , 2016, INTERSPEECH.

[3]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .

[4]  Alexander Gutkin,et al.  Uniform Multilingual Multi-Speaker Acoustic Model for Statistical Parametric Speech Synthesis of Low-Resourced Languages , 2017, INTERSPEECH.

[5]  William D. Davies A Grammar of Madurese , 2010 .

[6]  Richard Sproat,et al.  TTS for Low Resource Languages: A Bangla Synthesizer , 2016, LREC.

[7]  Fajri Koto A Publicly Available Indonesian Corpora for Automatic Abstractive and Extractive Chat Summarization , 2016, LREC.

[8]  Ivan Vulic,et al.  Survey on the Use of Typological Information in Natural Language Processing , 2016, COLING.

[9]  Heiga Zen,et al.  Multi-Language Multi-Speaker Acoustic Modeling for LSTM-RNN Based Statistical Parametric Speech Synthesis , 2016, INTERSPEECH.

[10]  Ruli Manurung,et al.  Developing an Online Indonesian Corpora Repository , 2010, PACLIC.

[11]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[12]  Satoshi Nakamura,et al.  Development of Indonesian Large Vocabulary Continuous Speech Recognition System within A-STAR Project , 2008, IJCNLP.

[13]  Heiga Zen,et al.  Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Scott Paauw,et al.  One Land, One Nation, One Language: An Analysis of Indonesia's National Language Policy , 2009 .

[15]  Methods , metrics and procedures for statistical evaluation , qualification and comparison of objective quality prediction models , 2013 .