Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework

Text-to-speech synthesis (TTS) has witnessed rapid progress in recent years, where neural methods became capable of producing audios with high naturalness. However, these efforts still suffer from two types of latencies: (a) the {\em computational latency} (synthesizing time), which grows linearly with the sentence length even with parallel approaches, and (b) the {\em input latency} in scenarios where the input text is incrementally generated (such as in simultaneous translation, dialog generation, and assistive technologies). To reduce these latencies, we devise the first neural incremental TTS approach based on the recently proposed prefix-to-prefix framework. We synthesize speech in an online fashion, playing a segment of audio while generating the next, resulting in an $O(1)$ rather than $O(n)$ latency.

[1]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[2]  Satoshi Nakamura,et al.  Sequence-to-Sequence Learning via Attention Transfer for Incremental Speech Recognition , 2019, INTERSPEECH.

[3]  Sercan Ömer Arik,et al.  Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[4]  Morgan Sonderegger,et al.  Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[5]  William H. Fisher,et al.  Better Than Well: American Medicine Meets the American Dream , 2006 .

[6]  Sungwon Kim,et al.  FloWaveNet : A Generative Flow for Raw Audio , 2018, ICML.

[7]  David Schlangen,et al.  INPRO_iSS: A Component for Just-In-Time Incremental Speech Synthesis , 2012, ACL.

[8]  Wei Ping,et al.  ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[9]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[10]  Srinivas Bangalore,et al.  Real-time Incremental Speech-to-Speech Translation of Dialogs , 2012, NAACL.

[11]  Liang Huang,et al.  Simultaneous Translation Policies: From Fixed to Adaptive , 2020, ACL.

[12]  Timo Baumann Partial representations improve the prosody of incremental speech synthesis , 2014, INTERSPEECH.

[13]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[14]  Jiahong Yuan,et al.  A corpus study of the 3rd tone sandhi in standard Chinese , 2007, INTERSPEECH.

[15]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[16]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[18]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Wei Ping,et al.  Non-Autoregressive Neural Text-to-Speech , 2020, ICML.

[20]  Gabriel Skantze,et al.  Towards Incremental Speech Generation in Dialogue Systems , 2010, SIGDIAL Conference.

[21]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[22]  Kenneth Ward Church,et al.  Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training , 2020, FINDINGS.

[23]  Timo Baumann Decision tree usage for incremental parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Renjie Zheng,et al.  Speculative Beam Search for Simultaneous Translation , 2019, EMNLP.

[25]  Jiahong Yuan,et al.  3 rd tone sandhi in Standard Chinese : A corpus approach , 2011 .

[26]  Stefan Kopp,et al.  Combining Incremental Language Generation and Incremental Speech Synthesis for Adaptive Information Presentation , 2012, SIGDIAL Conference.

[27]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Renjie Zheng,et al.  Simultaneous Translation with Flexible Policy via Restricted Imitation Learning , 2019, ACL.

[29]  Satoshi Nakamura,et al.  Neural iTTS: Toward Synthesizing Speech in Real-time with End-to-end Neural Text-to-Speech Framework , 2019 .

[30]  David Schlangen,et al.  The InproTK 2012 release , 2012, SDCTD@NAACL-HLT.

[31]  Gérard Bailly,et al.  HMM training strategy for incremental speech synthesis , 2015, INTERSPEECH.

[32]  David Schlangen,et al.  Evaluating Prosodic Processing for Incremental Speech Synthesis , 2012, INTERSPEECH.

[33]  Hideyuki Tachibana,et al.  Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Haifeng Wang,et al.  STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework , 2018, ACL.

[35]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[36]  Renjie Zheng,et al.  Opportunistic Decoding with Timely Correction for Simultaneous Translation , 2020, ACL.

[37]  Renjie Zheng,et al.  Simpler and Faster Learning of Adaptive Policies for Simultaneous Translation , 2019, EMNLP.

[38]  Zhao Song,et al.  Parallel Neural Text-to-Speech , 2019, ArXiv.

[39]  Satoshi Nakamura,et al.  Incremental TTS for Japanese Language , 2018, INTERSPEECH.