Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

This work applies a hierarchical transfer learning to implement deep neural network (DNN)-based multilingual text-to-speech (TTS) for low-resource languages. DNN-based system typically requires a large amount of training data. In recent years, while DNN-based TTS has made remarkable results for high-resource languages, it still suffers from a data scarcity problem for low-resource languages. In this article, we propose a multi-stage transfer learning strategy to train our TTS model for low-resource languages. We make use of a high-resource language and a joint multilingual dataset of low-resource languages. A pre-trained monolingual TTS on the high-resource language is fine-tuned on the low-resource language using the same model architecture. Then, we apply partial network-based transfer learning from the pre-trained monolingual TTS to a multilingual TTS and finally from the pre-trained multilingual TTS to a multilingual with style transfer TTS. Our experiment on Indonesian, Javanese, and Sundanese languages show adequate quality of synthesized speech. The evaluation of our multilingual TTS reaches a mean opinion score (MOS) of 4.35 for Indonesian (ground truth = 4.36). Whereas for Javanese and Sundanese it reaches a MOS of 4.20 (ground truth = 4.38) and 4.28 (ground truth = 4.20), respectively. For parallel style transfer evaluation, our TTS model reaches an F0 frame error (FFE) of 9.08%, 10.13%, and 8.43% for Indonesian, Javanese, and Sundanese, respectively. The results indicate that the proposed strategy can be effectively applied to the low-resource languages target domain. With a small amount of training data, our models are able to learn step by step from a smaller TTS network to larger networks, produce intelligible speech approaching the real human voice, and successfully transfer speaking style from a reference audio.

[1]  Hongfei Lin,et al.  Low-Resource Cross-Domain Product Review Sentiment Classification Based on a CNN with an Auxiliary Large-Scale Corpus , 2017, Algorithms.

[2]  Ausif Mahmood,et al.  Review of Deep Learning Algorithms and Architectures , 2019, IEEE Access.

[3]  Satoshi Nakamura,et al.  Listening while speaking: Speech chain by deep learning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[4]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Supheakmungkol Sarin,et al.  A Step-by-Step Process for Building TTS Voices Using Open Source Data and Frameworks for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese , 2018, SLTU.

[6]  Kexin Feng,et al.  Low-Resource Language Identification From Speech Using Transfer Learning , 2019, 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP).

[7]  Ye-Yi Wang,et al.  Is word error rate a good indicator for spoken language understanding accuracy , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[8]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[9]  Satoshi Nakamura,et al.  Cross-Lingual Machine Speech Chain for Javanese, Sundanese, Balinese, and Bataks Speech Recognition and Synthesis , 2020, SLTU/CCURL@LREC.

[10]  Zhiyong Wu,et al.  A Review of Deep Learning Based Speech Synthesis , 2019, Applied Sciences.

[11]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[12]  Yong Wu,et al.  Convolution Neural Network based Transfer Learning for Classification of Flowers , 2018, 2018 IEEE 3rd International Conference on Signal and Image Processing (ICSIP).

[13]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[14]  Dessi Puji Lestari,et al.  A Large Vocabulary Continuous Speech Recognition System for Indonesian Language , 2006 .

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Yifan Liu,et al.  Es-Tacotron2: Multi-Task Tacotron 2 with Pre-Trained Estimated Network for Reducing the Over-Smoothness Problem , 2019, Inf..

[17]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[18]  Hung-yi Lee,et al.  End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning , 2019, INTERSPEECH.

[19]  Chris Yakopcic,et al.  A State-of-the-Art Survey on Deep Learning Theory and Architectures , 2019, Electronics.

[20]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[21]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[22]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Mirna Adriani,et al.  Hierarchical Transfer Learning for Text-to-Speech in Indonesian, Javanese, and Sundanese Languages , 2020, 2020 International Conference on Advanced Computer Science and Information Systems (ICACSIS).

[24]  Quan Wang,et al.  Wavenet Based Low Rate Speech Coding , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Suryakanth V. Gangashetty,et al.  Deep Elman recurrent neural networks for statistical parametric speech synthesis , 2017, Speech Commun..

[26]  Takao Kobayashi,et al.  Statistical Parametric Speech Synthesis Using Deep Gaussian Processes , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Tatsuya Kawahara,et al.  Transfer Learning of Language-independent End-to-end ASR with Language Model Fusion , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[29]  Szu-Lin Wu,et al.  Improving Unsupervised Style Transfer in end-to-end Speech Synthesis with end-to-end Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[30]  Mauro Castelli,et al.  Transfer Learning with Convolutional Neural Networks for Diabetic Retinopathy Image Classification. A Review , 2020, Applied Sciences.

[31]  Xin Wang,et al.  Investigation of Enhanced Tacotron Text-to-speech Synthesis Systems with Self-attention for Pitch Accent Language , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Jiangyan Yi,et al.  Forward–Backward Decoding Sequence for Regularizing End-to-End TTS , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Chao Yang,et al.  A Survey on Deep Transfer Learning , 2018, ICANN.

[34]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[37]  Jianhua Tao,et al.  Language-Adversarial Transfer Learning for Low-Resource Speech Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[38]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[39]  Martine Grice,et al.  The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences , 1996, Speech Commun..

[40]  Wesley Mattheyses,et al.  Audiovisual speech synthesis: An overview of the state-of-the-art , 2015, Speech Commun..

[41]  Yuxuan Wang,et al.  Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Heiga Zen,et al.  Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[43]  Tomohiro Nakatani,et al.  A method for fundamental frequency estimation and voicing decision: Application to infant utterances recorded in real acoustical environments , 2008, Speech Commun..

[44]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[45]  Joseph P. Olive,et al.  Text-to-speech synthesis , 1995, AT&T Technical Journal.

[46]  Aditya Khamparia,et al.  A systematic review on deep learning architectures and applications , 2019, Expert Syst. J. Knowl. Eng..

[47]  Abeer Alwan,et al.  Reducing F0 Frame Error of F0 tracking algorithms under noisy conditions with an unvoiced/voiced classification frontend , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[48]  Ryan Prenger,et al.  Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[50]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[51]  Yang Liu,et al.  A Teacher-Student Framework for Zero-Resource Neural Machine Translation , 2017, ACL.

[52]  Yating Yang,et al.  Hierarchical Transfer Learning Architecture for Low-Resource Neural Machine Translation , 2019, IEEE Access.

[53]  Chongchong Yu,et al.  Cross-Language End-to-End Speech Recognition Research Based on Transfer Learning for the Low-Resource Tujia Language , 2019, Symmetry.

[54]  Sercan Ömer Arik,et al.  Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[55]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.