Enhancing Monotonicity for Robust Autoregressive Transformer TTS

With the development of sequence-to-sequence modeling algorithms, Text-to-Speech (TTS) techniques have achieved significant improvement in speech quality and naturalness. These deep learning algorithms, such as recurrent neural networks (RNNs) and its memory enhanced variations, have shown strong reconstruction ability from input linguistic features to acoustic features. However, the efficiency of these algorithms is limited for its sequential process in both training and inference. Recently, Transformer with superiority in parallelism is proposed to TTS. It employs the positional embedding instead of recurrent mechanism for position modeling and significantly boosts training speed. However, this approach lacks monotonic constraint and is deficient with issues like pronunciation skipping. Therefore, in this paper, we propose a monotonicity enhancing approach with the combining use of Stepwise Monotonic Attention (SMA) and multi-head attention for Transformer based TTS system. Experiments show the proposed approach can reduce bad cases from 53 of 500 sentences to 1, together with an improvement on MOS from 4.09 to 4.17 in the naturalness test.

[1]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[2]  Ming Zhou,et al.  Close to Human Quality TTS with Transformer , 2018, ArXiv.

[3]  Dong Yu,et al.  Enhancing Hybrid Self-attention Structure with Relative-position-aware Bias for Speech Synthesis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Colin Raffel,et al.  Online and Linear-Time Attention by Enforcing Monotonic Alignments , 2017, ICML.

[7]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[8]  Samy Bengio,et al.  Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[9]  Lei He,et al.  Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS , 2019, INTERSPEECH.

[10]  Lei Xie,et al.  On the localness modeling for the self-attention based end-to-end speech synthesis , 2020, Neural Networks.

[11]  Li-Rong Dai,et al.  Forward Attention in Sequence- To-Sequence Acoustic Modeling for Speech Synthesis , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[13]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[14]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[15]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[16]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[17]  Lior Wolf,et al.  VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop , 2017, ICLR.

[18]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[19]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[20]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Soroosh Mariooryad,et al.  Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).