JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis

Thanks to improvements in machine learning techniques including deep learning, a free large-scale speech corpus that can be shared between academic institutions and commercial companies has an important role. However, such a corpus for Japanese speech synthesis does not exist. In this paper, we designed a novel Japanese speech corpus, named the "JSUT corpus," that is aimed at achieving end-to-end speech synthesis. The corpus consists of 10 hours of reading-style speech data and its transcription and covers all of the main pronunciations of daily-use Japanese characters. In this paper, we describe how we designed and analyzed the corpus. The corpus is freely available online.

[1]  Keiichi Tokuda,et al.  XIMERA: a new TTS from ATR based on corpus-based technologies , 2004, SSW.

[2]  Shinnosuke Takamichi,et al.  Training algorithm to deceive Anti-Spoofing Verification for DNN-based speech synthesis , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Yuji Matsumoto,et al.  Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[4]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[5]  Oliver Watts,et al.  Unsupervised learning for text-to-speech synthesis , 2013 .

[6]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[7]  Shinnosuke Takamichi,et al.  Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities , 2017, INTERSPEECH.

[8]  Y. Tanaka,et al.  Compilation of a multilingual parallel corpus , 2001 .

[9]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[10]  Haruo Kubozono,et al.  Where does loanword prosody come from?: A case study of Japanese loanword accent , 2006 .

[11]  Tomoki Koriyama,et al.  Sampling-Based Speech Parameter Generation Using Moment-Matching Networks , 2017, INTERSPEECH.

[12]  Tomoyuki Kajiwara,et al.  Evaluation Dataset and System for Japanese Lexical Simplification , 2015, ACL.

[13]  Shuang Xu,et al.  First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention , 2016, INTERSPEECH.

[14]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[15]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[16]  Hidekazu Yamamoto,et al.  Automatic Easy Japanese Translation for information accessibility of foreigners , 2012 .

[17]  Tomoki Toda,et al.  Narrow Adaptive Regularization of weights for grapheme-to-phoneme conversion , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).