论文信息 - Neural iTTS: Toward Synthesizing Speech in Real-time with End-to-end Neural Text-to-Speech Framework

Neural iTTS: Toward Synthesizing Speech in Real-time with End-to-end Neural Text-to-Speech Framework

Real-time machine speech interpreters aim to mimic human interpreters that able to produce high-quality speech translations on the fly. It requires all system components, including speech recognition, machine translation, and text-to-speech (TTS), to perform incrementally before the speaker has spoken an entire sentence. For TTS, this poses problems as a standard framework commonly requires language-dependent contextual linguistics of a full sentence to produce a natural-sounding speech waveform. Existing studies of incremental TTS (iTTS) have mainly been conducted on a model based on hidden Markov model (HMM). Recently, end-to-end TTS based on a neural net has synthesized more natural speech than HMM-based systems. In this paper, we take an initial step to construct iTTS based on end-to-end neural framework (Neural iTTS) and investigate the effects of various incremental units on the quality of end-to-end neural speech synthesis in both English and Japanese.

Satoshi Nakamura | Sakriani Sakti | Tomoya Yanagita

[1] Satoshi Nakamura,et al. Listening while speaking: Speech chain by deep learning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[2] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[3] Frieda Goldman-Eisler,et al. Segmentation of input in simultaneous translation , 1972, Journal of psycholinguistic research.

[4] Graham Neubig,et al. Learning to Translate in Real-time with Neural Machine Translation , 2016, EACL.

[5] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[6] Gérard Bailly,et al. Adaptive Latency for Part-of-Speech Tagging in Incremental Text-to-Speech Synthesis , 2016, INTERSPEECH.

[7] Gérard Bailly,et al. HMM training strategy for incremental speech synthesis , 2015, INTERSPEECH.

[8] Shinnosuke Takamichi,et al. JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis , 2017, ArXiv.

[9] Timo Baumann. Decision tree usage for incremental parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Tomoki Toda,et al. Constructing a speech translation system using simultaneous interpretation data , 2013, IWSLT.

[11] Alexander H. Waibel,et al. Simultaneous translation of lectures and speeches , 2007, Machine Translation.

[12] Keiichi Tokuda,et al. Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[13] Tomoki Toda,et al. Simple, lexicalized choice of translation timing for simultaneous speech translation , 2013, INTERSPEECH.

[14] David Schlangen,et al. Evaluating Prosodic Processing for Incremental Speech Synthesis , 2012, INTERSPEECH.

[15] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Samy Bengio,et al. An Online Sequence-to-Sequence Model Using Partial Conditioning , 2015, NIPS.

[17] Keiichi Tokuda,et al. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[18] Keikichi Hirose,et al. Accent Sandhi Estimation of Tokyo Dialect of Japanese Using Conditional Random Fields , 2017, IEICE Trans. Inf. Syst..

[19] Tomoki Toda,et al. Optimizing Segmentation Strategies for Simultaneous Speech Translation , 2014, ACL.

[20] Hermann Ney,et al. Automatic sentence segmentation and punctuation prediction for spoken language translation , 2006, IWSLT.

[21] Satoshi Nakamura,et al. Incremental TTS for Japanese Language , 2018, INTERSPEECH.

[22] Jae Lim,et al. Signal estimation from modified short-time Fourier transform , 1984 .

[23] Tara N. Sainath,et al. Improving the Performance of Online Neural Transducer Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] Srinivas Bangalore,et al. Real-time Incremental Speech-to-Speech Translation of Dialogs , 2012, NAACL.

[25] Samy Bengio,et al. Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.