MoBoAligner: a Neural Alignment Model for Non-autoregressive TTS with Monotonic Boundary Search

To speed up the inference of neural speech synthesis, non-autoregressive models receive increasing attention recently. In non-autoregressive models, additional durations of text tokens are required to make a hard alignment between the encoder and the decoder. The duration-based alignment plays a crucial role since it controls the correspondence between text tokens and spectrum frames and determines the rhythm and speed of synthesized audio. To get better duration-based alignment and improve the quality of non-autoregressive speech synthesis, in this paper, we propose a novel neural alignment model named MoboAligner. Given the pairs of the text and mel spectrum, MoboAligner tries to identify the boundaries of text tokens in the given mel spectrum frames based on the token-frame similarity in the neural semantic space with an end-to-end framework. With these boundaries, durations can be extracted and used in the training of non-autoregressive TTS models. Compared with the duration extracted by TransformerTTS, MoboAligner brings improvement for the non-autoregressive TTS model on MOS (3.74 comparing to FastSpeech's 3.44). Besides, MoboAligner is task-specified and lightweight, which reduces the parameter number by 45% and the training time consuming by 30%.

[1]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[2]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[4]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  Colin Raffel,et al.  Online and Linear-Time Attention by Enforcing Monotonic Alignments , 2017, ICML.

[9]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[11]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[12]  Heiga Zen,et al.  Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[13]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[15]  Tomoki Toda,et al.  Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[17]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[18]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.