One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers. However, such model suffers from the limitation that it can only convert the voice to the speakers in the training data, which narrows down the applicable scenario of VC. In this paper, we proposed a novel one-shot VC approach which is able to perform VC by only an example utterance from source and target speaker respectively, and the source and target speaker do not even need to be seen during training. This is achieved by disentangling speaker and content representations with instance normalization (IN). Objective and subjective evaluation shows that our model is able to generate the voice similar to target speaker. In addition to the performance measurement, we also demonstrate that this model is able to learn meaningful speaker representations without any supervision.

[1]  S. Arun Kumar,et al.  Improving the intelligibility of dysarthric speech towards enhancing the effectiveness of speech therapy , 2016, 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[2]  Li-Rong Dai,et al.  Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Satoshi Nakamura,et al.  Machine Speech Chain with One-shot Speaker Adaptation , 2018, INTERSPEECH.

[4]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[5]  Lauri Juvela,et al.  Non-parallel voice conversion using i-vector PLDA: towards unifying speaker verification and transformation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[7]  Kou Tanaka,et al.  Synthetic-to-Natural Speech Waveform Conversion Using Cycle-Consistent Adversarial Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[8]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[9]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[10]  Hao Wang,et al.  Personalized, Cross-Lingual TTS Using Phonetic Posteriorgrams , 2016, INTERSPEECH.

[11]  Hirokazu Kameoka,et al.  Generative adversarial network-based postfilter for statistical parametric speech synthesis , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[13]  Kou Tanaka,et al.  Automatic Speech Pronunciation Correction with Dynamic Frequency Warping-Based Spectral Conversion , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[14]  Kou Tanaka,et al.  StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[15]  James Glass,et al.  Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Tetsuya Takiguchi,et al.  High-order sequence modeling using speaker-dependent recurrent temporal restricted boltzmann machines for voice conversion , 2014, INTERSPEECH.

[17]  Yu Tsao,et al.  Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[18]  Yu Tsao,et al.  Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech , 2019, INTERSPEECH.

[19]  Daniel Rueckert,et al.  Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Lin-Shan Lee,et al.  Rhythm-Flexible Voice Conversion Without Parallel Data Using Cycle-GAN Over Phoneme Posteriorgram Sequences , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[21]  Hirokazu Kameoka,et al.  Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks , 2017, ArXiv.

[22]  Kou Tanaka,et al.  ACVAE-VC: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder , 2018, ArXiv.

[23]  Lin-Shan Lee,et al.  Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations , 2018, INTERSPEECH.

[24]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[25]  Tetsuya Takiguchi,et al.  Voice Conversion Based on Speaker-Dependent Restricted Boltzmann Machines , 2014, IEICE Trans. Inf. Syst..

[26]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Yoshua Bengio,et al.  Learning Speaker Representations with Mutual Information , 2018, INTERSPEECH.

[28]  Seyed Hamidreza Mohammadi,et al.  Voice conversion using deep neural networks with speaker-independent pre-training , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[29]  Kumar Krishna Agrawal,et al.  GANSynth: Adversarial Neural Audio Synthesis , 2019, ICLR.

[30]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[31]  Shinnosuke Takamichi,et al.  Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities , 2017, INTERSPEECH.

[32]  Junichi Yamagishi,et al.  SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .

[33]  Jan Kautz,et al.  Unsupervised Image-to-Image Translation Networks , 2017, NIPS.

[34]  Kun Li,et al.  Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[36]  Haifeng Li,et al.  A KL Divergence and DNN-Based Approach to Voice Conversion without Parallel Training Sentences , 2016, INTERSPEECH.