Neural iTTS: Toward Synthesizing Speech in Real-time with End-to-end Neural Text-to-Speech Framework

Real-time machine speech interpreters aim to mimic human interpreters that able to produce high-quality speech translations on the fly. It requires all system components, including speech recognition, machine translation, and text-to-speech (TTS), to perform incrementally before the speaker has spoken an entire sentence. For TTS, this poses problems as a standard framework commonly requires language-dependent contextual linguistics of a full sentence to produce a natural-sounding speech waveform. Existing studies of incremental TTS (iTTS) have mainly been conducted on a model based on hidden Markov model (HMM). Recently, end-to-end TTS based on a neural net has synthesized more natural speech than HMM-based systems. In this paper, we take an initial step to construct iTTS based on end-to-end neural framework (Neural iTTS) and investigate the effects of various incremental units on the quality of end-to-end neural speech synthesis in both English and Japanese.

[1]  Satoshi Nakamura,et al.  Listening while speaking: Speech chain by deep learning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[2]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[3]  Frieda Goldman-Eisler,et al.  Segmentation of input in simultaneous translation , 1972, Journal of psycholinguistic research.

[4]  Graham Neubig,et al.  Learning to Translate in Real-time with Neural Machine Translation , 2016, EACL.

[5]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[6]  Gérard Bailly,et al.  Adaptive Latency for Part-of-Speech Tagging in Incremental Text-to-Speech Synthesis , 2016, INTERSPEECH.

[7]  Gérard Bailly,et al.  HMM training strategy for incremental speech synthesis , 2015, INTERSPEECH.

[8]  Shinnosuke Takamichi,et al.  JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis , 2017, ArXiv.

[9]  Timo Baumann Decision tree usage for incremental parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Tomoki Toda,et al.  Constructing a speech translation system using simultaneous interpretation data , 2013, IWSLT.

[11]  Alexander H. Waibel,et al.  Simultaneous translation of lectures and speeches , 2007, Machine Translation.

[12]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[13]  Tomoki Toda,et al.  Simple, lexicalized choice of translation timing for simultaneous speech translation , 2013, INTERSPEECH.

[14]  David Schlangen,et al.  Evaluating Prosodic Processing for Incremental Speech Synthesis , 2012, INTERSPEECH.

[15]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Samy Bengio,et al.  An Online Sequence-to-Sequence Model Using Partial Conditioning , 2015, NIPS.

[17]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[18]  Keikichi Hirose,et al.  Accent Sandhi Estimation of Tokyo Dialect of Japanese Using Conditional Random Fields , 2017, IEICE Trans. Inf. Syst..

[19]  Tomoki Toda,et al.  Optimizing Segmentation Strategies for Simultaneous Speech Translation , 2014, ACL.

[20]  Hermann Ney,et al.  Automatic sentence segmentation and punctuation prediction for spoken language translation , 2006, IWSLT.

[21]  Satoshi Nakamura,et al.  Incremental TTS for Japanese Language , 2018, INTERSPEECH.

[22]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[23]  Tara N. Sainath,et al.  Improving the Performance of Online Neural Transducer Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Srinivas Bangalore,et al.  Real-time Incremental Speech-to-Speech Translation of Dialogs , 2012, NAACL.

[25]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.