Hierarchical Transfer Learning for Text-to-Speech in Indonesian, Javanese, and Sundanese Languages

This research develops end-to-end deep learning-based text-to-speech (TTS) in Indonesian, Javanese, and Sundanese. While end-to-end neural TTS, such as Tacotron-2, has made remarkable progress recently, it still suffers from a data scarcity problem for low-resource languages such as Javanese and Sundanese. Our preliminary study shows that Tacotron-2-based TTS needs a large amount of training data; a minimum of 10 hours of training data is required for the model to be able to synthesize acceptable quality and intelligible speech. To solve this low-resource problem, our work proposes a hierarchical transfer learning to train TTS for Javanese and Sundanese, by taking advantage of a dissimilar high-resource language of English domain and a similar intermediate-resource language of Indonesian domain. We report that the evaluation of synthesized speech using the mean opinion score (MOS) reaches 4.27 for Indonesian, and 4.08 for Javanese, and 3.92 for Sundanese. The word accuracy (WAcc) evaluation on semantically unpredicted sentences (SUS) reaches 98.26% for Indonesian, 95.02% for Javanese, and 95.43% for Sundanese. The subjective evaluations of the synthetic speech quality demonstrate that our transfer learning scheme is successfully applied to TTS model for low-resource target domain. Using less than one hour of training data, 38 minutes for Indonesian, 16 minutes for Javanese, and 19 minutes for Sundanese, TTS models can learn fast and achieve adequate performance.

[1]  Takao Kobayashi,et al.  Statistical Parametric Speech Synthesis Using Deep Gaussian Processes , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Tatsuya Kawahara,et al.  Transfer Learning of Language-independent End-to-end ASR with Language Model Fusion , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Sercan Ömer Arik,et al.  Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[4]  Jianhua Tao,et al.  Language-Adversarial Transfer Learning for Low-Resource Speech Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Szu-Lin Wu,et al.  Improving Unsupervised Style Transfer in end-to-end Speech Synthesis with end-to-end Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[6]  Mauro Castelli,et al.  Transfer Learning with Convolutional Neural Networks for Diabetic Retinopathy Image Classification. A Review , 2020, Applied Sciences.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Heiga Zen,et al.  Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[9]  James T. Collins KERAGAMAN BAHASA DAN KESEPAKATAN MASYARAKAT: PLURALITAS DAN KOMUNIKASI , 2014 .

[10]  Zhiyong Wu,et al.  A Review of Deep Learning Based Speech Synthesis , 2019, Applied Sciences.

[11]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[12]  Jiangyan Yi,et al.  Forward–Backward Decoding Sequence for Regularizing End-to-End TTS , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Xin Wang,et al.  Investigation of Enhanced Tacotron Text-to-speech Synthesis Systems with Self-attention for Pitch Accent Language , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Yuxuan Wang,et al.  Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[17]  Chao Yang,et al.  A Survey on Deep Transfer Learning , 2018, ICANN.

[18]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[20]  Wesley Mattheyses,et al.  Audiovisual speech synthesis: An overview of the state-of-the-art , 2015, Speech Commun..

[21]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[22]  Martine Grice,et al.  The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences , 1996, Speech Commun..

[23]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[24]  Hongfei Lin,et al.  Low-Resource Cross-Domain Product Review Sentiment Classification Based on a CNN with an Auxiliary Large-Scale Corpus , 2017, Algorithms.

[25]  Ausif Mahmood,et al.  Review of Deep Learning Algorithms and Architectures , 2019, IEEE Access.

[26]  Satoshi Nakamura,et al.  Listening while speaking: Speech chain by deep learning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[27]  Supheakmungkol Sarin,et al.  A Step-by-Step Process for Building TTS Voices Using Open Source Data and Frameworks for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese , 2018, SLTU.

[28]  Kexin Feng,et al.  Low-Resource Language Identification From Speech Using Transfer Learning , 2019, 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP).

[29]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Chris Yakopcic,et al.  A State-of-the-Art Survey on Deep Learning Theory and Architectures , 2019, Electronics.

[31]  Ye-Yi Wang,et al.  Is word error rate a good indicator for spoken language understanding accuracy , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[32]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[33]  Yang Liu,et al.  A Teacher-Student Framework for Zero-Resource Neural Machine Translation , 2017, ACL.

[34]  Yating Yang,et al.  Hierarchical Transfer Learning Architecture for Low-Resource Neural Machine Translation , 2019, IEEE Access.

[35]  Jian Zhu,et al.  Probing the phonetic and phonological knowledge of tones in Mandarin TTS models , 2019, Speech Prosody 2020.

[36]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[37]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Quan Wang,et al.  Wavenet Based Low Rate Speech Coding , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Ying Chen,et al.  Implementing Prosodic Phrasing in Chinese End-to-end Speech Synthesis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Joseph P. Olive,et al.  Text-to-speech synthesis , 1995, AT&T Technical Journal.

[41]  Aditya Khamparia,et al.  A systematic review on deep learning architectures and applications , 2019, Expert Syst. J. Knowl. Eng..

[42]  Dessi Puji Lestari,et al.  A Large Vocabulary Continuous Speech Recognition System for Indonesian Language , 2006 .

[43]  Yifan Liu,et al.  Es-Tacotron2: Multi-Task Tacotron 2 with Pre-Trained Estimated Network for Reducing the Over-Smoothness Problem , 2019, Inf..

[44]  Yong Wu,et al.  Convolution Neural Network based Transfer Learning for Classification of Flowers , 2018, 2018 IEEE 3rd International Conference on Signal and Image Processing (ICSIP).

[45]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[46]  Chongchong Yu,et al.  Cross-Language End-to-End Speech Recognition Research Based on Transfer Learning for the Low-Resource Tujia Language , 2019, Symmetry.