ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders

This paper presents ByteSing, a Chinese singing voice synthesis (SVS) system based on duration allocated Tacotron-like acoustic models and WaveRNN neural vocoders. Different from the conventional SVS models, the proposed ByteSing employs Tacotron-like encoder-decoder structures as the acoustic models, in which the CBHG models and recurrent neural networks (RNNs) are explored as encoders and decoders respectively. Meanwhile an auxiliary phoneme duration prediction model is utilized to expand the input sequence, which can enhance the model controllable capacity, model stability and tempo prediction accuracy. WaveRNN neural vocoders are also adopted as neural vocoders to further improve the voice quality of synthesized songs. Both objective and subjective experimental results prove that the SVS method proposed in this paper can produce quite natural, expressive and high-fidelity songs by improving the pitch and spectrogram prediction accuracy and the models using attention mechanism can achieve best performance.

[1]  Heiga Zen,et al.  Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.

[2]  Yoshihiko Nankaku,et al.  Recent Development of the DNN-based Singing Voice Synthesis System — Sinsy , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[3]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Jong-Jin Kim,et al.  Korean Singing Voice Synthesis Based on an LSTM Recurrent Neural Network , 2018, INTERSPEECH.

[6]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[7]  Yoshihiko Nankaku,et al.  Singing voice synthesis based on convolutional neural networks , 2019, ArXiv.

[8]  Chengzhu Yu,et al.  Learning Singing From Speech , 2019, ArXiv.

[9]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[10]  Kyogu Lee,et al.  Adversarially Trained End-to-end Korean Singing Voice Synthesis System , 2019, INTERSPEECH.

[11]  Li-Rong Dai,et al.  Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling , 2019, INTERSPEECH.

[12]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[13]  Sercan Ömer Arik,et al.  Deep Voice 3: 2000-Speaker Neural Text-to-Speech , 2017, ICLR 2018.

[14]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[15]  Soroosh Mariooryad,et al.  Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Jordi Bonada,et al.  Sequence-to-Sequence Singing Synthesis Using the Feed-Forward Transformer , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Yoshihiko Nankaku,et al.  Singing Voice Synthesis Based on Deep Neural Networks , 2016, INTERSPEECH.

[18]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[19]  Michael Good,et al.  MusicXML in Commercial Applications , 2006 .

[20]  Thomas Drugman,et al.  Singing Synthesis: with a little help from my attention , 2020, INTERSPEECH.

[21]  Heiga Zen,et al.  An HMM-based singing voice synthesis system , 2006, INTERSPEECH.

[22]  Chengzhu Yu,et al.  DurIAN: Duration Informed Attention Network For Multimodal Synthesis , 2019, ArXiv.

[23]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[24]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.