Non-Parallel Voice Conversion with Augmented Classifier Star Generative Adversarial Networks

We previously proposed a method that allows for non-parallel voice conversion (VC) by using a variant of generative adversarial networks (GANs) called StarGAN. The main features of our method, called StarGAN-VC, are as follows: First, it requires no parallel utterances, transcriptions, or time alignment procedures for speech generator training. Second, it can simultaneously learn mappings across multiple domains using a single generator network so that it can fully exploit available training data collected from multiple domains to capture latent features that are common to all the domains. Third, it is able to generate converted speech signals quickly enough to allow real-time implementations and requires only several minutes of training examples to generate reasonably realistic-sounding speech. In this paper, we describe three formulations of StarGAN, including a newly introduced novel StarGAN variant called "Augmented classifier StarGAN (A-StarGAN)", and compare them in a non-parallel VC task. We also compare them with several baseline methods.

[1]  Marc Schröder,et al.  Evaluation of Expressive Speech Synthesis With Voice Conversion and Copy Resynthesis Techniques , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Tomoki Toda,et al.  Non-Parallel Voice Conversion with Cyclic Variational Autoencoder , 2019, INTERSPEECH.

[3]  Haizhou Li,et al.  Group Sparse Representation With WaveNet Vocoder Adaptation for Spectrum and Prosody Conversion , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[5]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[6]  Tomoki Toda,et al.  Postfilters to Modify the Modulation Spectrum for Statistical Parametric Speech Synthesis , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Yonghong Yan,et al.  High Quality Voice Conversion through Phoneme-Based Linear Mapping Functions with STRAIGHT for Mandarin , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[8]  Lauri Juvela,et al.  Non-parallel voice conversion using i-vector PLDA: towards unifying speaker verification and transformation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Jordi Bonada,et al.  Modeling and Transforming Speech Using Variational Autoencoders , 2016, INTERSPEECH.

[10]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[11]  Hirokazu Kameoka,et al.  Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining , 2019, INTERSPEECH.

[12]  Sungwon Kim,et al.  FloWaveNet : A Generative Flow for Raw Audio , 2018, ICML.

[13]  Haifeng Li,et al.  A KL Divergence and DNN-Based Approach to Voice Conversion without Parallel Training Sentences , 2016, INTERSPEECH.

[14]  Kou Tanaka,et al.  Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[16]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Shinnosuke Takamichi,et al.  Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[21]  Tomoki Toda,et al.  sprocket: Open-Source Voice Conversion Software , 2018, Odyssey.

[22]  Peter Jax,et al.  Artificial bandwidth extension of speech signals using MMSE estimation based on a hidden Markov model , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[23]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[24]  Li-Rong Dai,et al.  Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Kou Tanaka,et al.  ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Li-Rong Dai,et al.  Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[28]  Hyunsoo Kim,et al.  Learning to Discover Cross-Domain Relations with Generative Adversarial Networks , 2017, ICML.

[29]  Junichi Yamagishi,et al.  The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods , 2018, Odyssey.

[30]  Tetsuya Takiguchi,et al.  Exemplar-Based Voice Conversion Using Sparse Representation in Noisy Environments , 2013, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[31]  John-Paul Hosom,et al.  Improving the intelligibility of dysarthric speech , 2007, Speech Commun..

[32]  Tetsuya Takiguchi,et al.  Voice Conversion Based on Speaker-Dependent Restricted Boltzmann Machines , 2014, IEICE Trans. Inf. Syst..

[33]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[34]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[35]  Xin Wang,et al.  Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Kou Tanaka,et al.  StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion , 2019, INTERSPEECH.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Hirokazu Kameoka,et al.  Generative adversarial network-based postfilter for statistical parametric speech synthesis , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Haizhou Li,et al.  Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40]  Yu Tsao,et al.  Voice conversion from non-parallel corpora using variational auto-encoder , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[41]  Kou Tanaka,et al.  Many-to-Many Voice Transformer Network , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[42]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[43]  Jung-Woo Ha,et al.  StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Seyed Hamidreza Mohammadi,et al.  Voice conversion using deep neural networks with speaker-independent pre-training , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[45]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Joan Serra,et al.  Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion , 2019, NeurIPS.

[47]  Mark Hasegawa-Johnson,et al.  Zero-Shot Voice Style Transfer with Only Autoencoder Loss , 2019, ICML.

[48]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[49]  Léon Bottou,et al.  Towards Principled Methods for Training Generative Adversarial Networks , 2017, ICLR.

[50]  Kishore Prahallad,et al.  Spectral Mapping Using Artificial Neural Networks for Voice Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[51]  Kou Tanaka,et al.  ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[52]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[53]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[54]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[55]  Nobuaki Minematsu,et al.  Parallel-Data-Free Many-to-Many Voice Conversion Based on DNN Integrated with Eigenspace Using a Non-Parallel Speech Corpus , 2017, INTERSPEECH.

[56]  Hirokazu Kameoka,et al.  Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks , 2017, ArXiv.

[57]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[58]  Hirokazu Kameoka,et al.  Generative Adversarial Network-Based Postfilter for STFT Spectrograms , 2017, INTERSPEECH.

[59]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[60]  Ricardo Gutierrez-Osuna,et al.  Foreign accent conversion in computer assisted pronunciation training , 2009, Speech Commun..

[61]  Kun Li,et al.  Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[62]  Mikihiro Nakagiri,et al.  Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[63]  Ole Winther,et al.  Autoencoding beyond pixels using a learned similarity metric , 2015, ICML.

[64]  Tomoki Toda,et al.  Speaker-Dependent WaveNet Vocoder , 2017, INTERSPEECH.

[65]  Steve J. Young,et al.  Data-driven emotion conversion in spoken English , 2009, Speech Commun..

[66]  Tetsuya Takiguchi,et al.  High-order sequence modeling using speaker-dependent recurrent temporal restricted boltzmann machines for voice conversion , 2014, INTERSPEECH.

[67]  Ping Tan,et al.  DualGAN: Unsupervised Dual Learning for Image-to-Image Translation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[68]  Wei Ping,et al.  ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[69]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[70]  Yu Tsao,et al.  Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[71]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[72]  Kou Tanaka,et al.  Generative adversarial network-based approach to signal reconstruction from magnitude spectrograms , 2018 .

[73]  Adam Finkelstein,et al.  Fftnet: A Real-Time Speaker-Dependent Neural Vocoder , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[74]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[75]  Kou Tanaka,et al.  ATTS2S-VC: Sequence-to-sequence Voice Conversion with Attention and Context Preservation Mechanisms , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[76]  Shinnosuke Takamichi,et al.  Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[77]  Shinnosuke Takamichi,et al.  Voice Conversion Using Input-to-Output Highway Networks , 2017, IEICE Trans. Inf. Syst..