论文信息 - End-to-end Speech Synthesis for Tibetan Lhasa Dialect

End-to-end Speech Synthesis for Tibetan Lhasa Dialect

Speech synthesis for Tibetan Lhasa dialect is implemented on the basis of an end-to-end novel speech synthesis framework, Tacotron. The training transcript has used the phoneme list transcribed from Tibetan characters, and feature parameters were extracted from the mel-spectrogram. Then the model is trained by the mapping of character to spectrum. Tibetan language is an important minority language of the Chinese nation, but there is little research on Tibetan language at present. The experimental results were compared with traditional speech synthesis methods, with the audio quality significantly better than that of the traditional GMM-HMM in both naturalness and rhythm. It provides a crucial reference for the later research methods of Tibetan language and promotes the development of Tibetan language research.

Guanyu Li | Lisai Luo | Chunwei Gong | Hailan Ding

[1] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Karen Simonyan,et al. Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[3] Sercan Ömer Arik,et al. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[4] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[5] Samy Bengio,et al. Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[6] Garrison W. Cottrell,et al. A Dual-Stage Attention-Based Recurrent Neural Network for Time Series Prediction , 2017, IJCAI.

[7] Samy Bengio,et al. Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[8] Nasser M. Nasrabadi,et al. GASL: Guided Attention for Sparsity Learning in Deep Neural Networks , 2019, ArXiv.

[9] Thomas S. Huang,et al. Fast Wavenet Generation Algorithm , 2016, ArXiv.