Lhasa-Tibetan Speech Synthesis Using End-to-End Model

With the development of deep learning technology, speech synthesis based on deep neural networks has gradually become the mainstream method in the field of speech synthesis. In this paper, we explored the Tacotron2 model for Lhasa-Tibetan dialect speech synthesis by constructing a feature prediction network with a seq2seq structure which maps the character vector to Mel spectrum, and combining with the WaveNet model trained in a semi-supervised way to synthesize the Mel spectrum into a time domain waveform. The model avoids processing front-end text analysis that requires extensive prior knowledge in Lhasa-Tibetan dialect and reduces the need of a large amount of transcribed speech data. Experimental results show that the proposed method is effective and has higher clarity and naturalness than other related synthesis models for Lhasa-Tibetan dialect.

[1]  Guan Gui,et al.  Behavioral Modeling and Linearization of Wideband RF Power Amplifiers Using BiLSTM Networks for 5G Wireless Systems , 2019, IEEE Transactions on Vehicular Technology.

[2]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Navdeep Jaitly,et al.  Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[4]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[5]  Guanyu Li,et al.  End-to-end Speech Synthesis for Tibetan Lhasa Dialect , 2019 .

[6]  Guan Gui,et al.  Deep Learning for an Effective Nonorthogonal Multiple Access Scheme , 2018, IEEE Transactions on Vehicular Technology.

[7]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[8]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[10]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[11]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[12]  Yang Peng,et al.  UL-CSI Data Driven Deep Learning for Predicting DL-CSI in Cellular FDD Systems , 2019, IEEE Access.

[13]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[14]  Zheng Wen-si Technology Research on Tibetan Lhasa Speech Synthesis Based on HMM , 2011 .

[15]  Qiang Ji,et al.  The Deep Regression Bayesian Network and Its Applications: Probabilistic Deep Learning for Computer Vision , 2018, IEEE Signal Processing Magazine.

[16]  Jie Yang,et al.  Data-Driven Deep Learning for Automatic Modulation Recognition in Cognitive Radios , 2019, IEEE Transactions on Vehicular Technology.

[17]  Fumiyuki Adachi,et al.  Deep-Learning-Based Millimeter-Wave Massive MIMO for Hybrid Precoding , 2019, IEEE Transactions on Vehicular Technology.

[18]  Paul Taylor,et al.  Automatically clustering similar units for unit selection in speech synthesis , 1997, EUROSPEECH.