Integration of Accent Sandhi and Prosodic Features Estimation for Japanese Text-to-Speech Synthesis

In recent years, Japanese text-to-speech (TTS) synthesis methods have been actively researched. We need to estimate appropriate prosodic information for generating a high-quality synthetic speech. However, manual annotation is costly, and automatic annotation introduces estimation errors. This paper examines the integration of accent sandhi and prosodic feature estimation in the acoustic modeling for Japanese TTS to overcome the problems. The proposed method achieves total optimization of the F0 model by using the linguistic features from a dictionary. Objective and subjective evaluations confirmed that the cost of creating accent labels was reduced, and the accuracy of the prosodic feature estimation was improved.

[1]  Keikichi Hirose,et al.  Accent Sandhi Estimation of Tokyo Dialect of Japanese Using Conditional Random Fields , 2017, IEICE Trans. Inf. Syst..

[2]  Y. Sagisaka,et al.  Accentuation rules for japanese text-to-speech conversion , 1984 .

[3]  Xin Wang,et al.  Investigation of Enhanced Tacotron Text-to-speech Synthesis Systems with Self-attention for Pitch Accent Language , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Takuya Akiba,et al.  Optuna: A Next-generation Hyperparameter Optimization Framework , 2019, KDD.

[5]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.