ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion

This paper proposes a voice conversion method based on fully convolutional sequence-to-sequence (seq2seq) learning. The present method, which we call “ConvS2S-VC”, learns the mapping between source and target speech feature sequences using a fully convolutional seq2seq model with an attention mechanism. Owing to the nature of seq2seq learning, our method is particularly noteworthy in that it allows the flexible conversion of not only the voice characteristics but also the pitch contour and duration of the input speech. The current model consists of six networks, namely source and target encoders, a target decoder, source and target reconstructors and a postnet, which are designed using dilated causal convolution networks with gated linear units. Subjective evaluation experiments revealed that the proposed method obtained higher sound quality and speaker similarity than a baseline method.

[1]  Tsao Yu,et al.  Voice conversion from non-parallel corpora using variational auto-encoder , 2016 .

[2]  Seyed Hamidreza Mohammadi,et al.  Voice conversion using deep neural networks with speaker-independent pre-training , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[3]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[4]  Mikihiro Nakagiri,et al.  Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Tomoki Toda,et al.  Speaker-Dependent WaveNet Vocoder , 2017, INTERSPEECH.

[6]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[7]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[8]  Shinnosuke Takamichi,et al.  Voice Conversion Using Input-to-Output Highway Networks , 2017, IEICE Trans. Inf. Syst..

[9]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[10]  Hirokazu Kameoka,et al.  Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks , 2017, ArXiv.

[11]  Li-Rong Dai,et al.  WaveNet Vocoder with Limited Training Data for Voice Conversion , 2018, INTERSPEECH.

[12]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[13]  Hirokazu Kameoka,et al.  Sequence-to-Sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks , 2017, INTERSPEECH.

[14]  Hideyuki Tachibana,et al.  Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[17]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[20]  Kou Tanaka,et al.  ATTS2S-VC: Sequence-to-sequence Voice Conversion with Attention and Context Preservation Mechanisms , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Tomoki Toda,et al.  The Voice Conversion Challenge 2016 , 2016, INTERSPEECH.

[22]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[23]  Steve J. Young,et al.  Data-driven emotion conversion in spoken English , 2009, Speech Commun..

[24]  Shinnosuke Takamichi,et al.  Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities , 2017, INTERSPEECH.

[25]  Kishore Prahallad,et al.  Spectral Mapping Using Artificial Neural Networks for Voice Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Kou Tanaka,et al.  StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[27]  Kun Li,et al.  Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[29]  Yu Tsao,et al.  Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[30]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[31]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[32]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[33]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[34]  Sercan Ömer Arik,et al.  Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[35]  Tomoki Toda,et al.  sprocket: Open-Source Voice Conversion Software , 2018, Odyssey.

[36]  Junichi Yamagishi,et al.  The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods , 2018, Odyssey.

[37]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[38]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[40]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Tomoki Toda,et al.  Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech , 2012, Speech Commun..

[42]  Marc Schröder,et al.  Evaluation of Expressive Speech Synthesis With Voice Conversion and Copy Resynthesis Techniques , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[43]  John-Paul Hosom,et al.  Improving the intelligibility of dysarthric speech , 2007, Speech Commun..

[44]  Kou Tanaka,et al.  ACVAE-VC: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder , 2018, ArXiv.