Sequence-to-Sequence Singing Synthesis Using the Feed-Forward Transformer

We propose a sequence-to-sequence singing synthesizer, which avoids the need for training data with pre-aligned phonetic and acoustic features. Rather than the more common approach of a content-based attention mechanism combined with an autoregressive decoder, we use a different mechanism suitable for feed-forward synthesis. Given that phonetic timings in singing are highly constrained by the musical score, we derive an approximate initial alignment with the help of a simple duration model. Then, using a decoder based on a feed-forward variant of the Transformer model, a series of self-attention and convolutional layers refines the result of the initial alignment to reach the target acoustic features. Advantages of this approach include faster inference and avoiding the exposure bias issues that affect autoregressive models trained by teacher forcing. We evaluate the effectiveness of this model compared to an autoregressive baseline, the importance of self-attention, and the importance of the accuracy of the duration model.

[1]  Axel Röbel,et al.  Analysing Deep Learning-Spectral Envelope Prediction Methods for Singing Synthesis , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[2]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[3]  Samy Bengio,et al.  Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[4]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[7]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[8]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Yoshihiko Nankaku,et al.  Singing voice synthesis based on convolutional neural networks , 2019, ArXiv.

[10]  Kyogu Lee,et al.  Adversarially Trained End-to-end Korean Singing Voice Synthesis System , 2019, INTERSPEECH.

[11]  Emilia Gómez,et al.  WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[12]  Hideyuki Tachibana,et al.  Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Zhao Song,et al.  Parallel Neural Text-to-Speech , 2019, ArXiv.

[14]  Tomoki Toda,et al.  Speaker-Dependent WaveNet Vocoder , 2017, INTERSPEECH.

[15]  Jordi Bonada,et al.  A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs , 2017 .

[16]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[17]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[18]  Xin Wang,et al.  Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[20]  Yoshihiko Nankaku,et al.  Singing Voice Synthesis Based on Generative Adversarial Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[22]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Li-Rong Dai,et al.  Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling , 2019, INTERSPEECH.

[24]  Matthias Sperber,et al.  Self-Attentional Acoustic Models , 2018, INTERSPEECH.

[25]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.