ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder

This paper proposes a non-parallel voice conversion (VC) method using a variant of the conditional variational autoencoder (VAE) called an auxiliary classifier VAE. The proposed method has two key features. First, it adopts fully convolutional architectures to construct the encoder and decoder networks so that the networks can learn conversion rules that capture the time dependencies in the acoustic feature sequences of source and target speech. Second, it uses information-theoretic regularization for the model training to ensure that the information in the attribute class label will not be lost in the conversion process. With regular conditional VAEs, the encoder and decoder are free to ignore the attribute class label input. This can be problematic since in such a situation, the attribute class label will have little effect on controlling the voice characteristics of input speech at test time. Such situations can be avoided by introducing an auxiliary classifier and training the encoder and decoder so that the attribute classes of the decoder outputs are correctly predicted by the classifier. We also present several ways to convert the feature sequence of input speech using the trained encoder and decoder and compare them in terms of audio quality through objective and subjective evaluations. We confirmed experimentally that the proposed method outperformed baseline non-parallel VC systems and performed comparably to an open-source parallel VC system trained using a parallel corpus in a speaker identity conversion task.

[1]  Jung-Woo Ha,et al.  StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Mikihiro Nakagiri,et al.  Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[4]  Ole Winther,et al.  Autoencoding beyond pixels using a learned similarity metric , 2015, ICML.

[5]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Tomoki Toda,et al.  Speaker-Dependent WaveNet Vocoder , 2017, INTERSPEECH.

[8]  Marc Schröder,et al.  Evaluation of Expressive Speech Synthesis With Voice Conversion and Copy Resynthesis Techniques , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Haizhou Li,et al.  Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[11]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[12]  Steve J. Young,et al.  Data-driven emotion conversion in spoken English , 2009, Speech Commun..

[13]  Tetsuya Takiguchi,et al.  High-order sequence modeling using speaker-dependent recurrent temporal restricted boltzmann machines for voice conversion , 2014, INTERSPEECH.

[14]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Kun Li,et al.  Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[17]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[18]  Jordi Bonada,et al.  Observation-model error compensation for enhanced spectral envelope transformation in voice conversion , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[19]  Tomoki Toda,et al.  Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech , 2012, Speech Commun..

[20]  Kou Tanaka,et al.  StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[21]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[22]  Jordi Bonada,et al.  Modeling and Transforming Speech Using Variational Autoencoders , 2016, INTERSPEECH.

[23]  Haifeng Li,et al.  A KL Divergence and DNN-Based Approach to Voice Conversion without Parallel Training Sentences , 2016, INTERSPEECH.

[24]  Kishore Prahallad,et al.  Spectral Mapping Using Artificial Neural Networks for Voice Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[26]  Hirokazu Kameoka,et al.  Sequence-to-Sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks , 2017, INTERSPEECH.

[27]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[29]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[30]  Tomoki Toda,et al.  Post-Filters to Modify the Modulation Spectrum for Statistical Parametric Speech Synthesis , 2016 .

[31]  Kou Tanaka,et al.  Synthetic-to-Natural Speech Waveform Conversion Using Cycle-Consistent Adversarial Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[32]  Tsao Yu,et al.  Voice conversion from non-parallel corpora using variational auto-encoder , 2016 .

[33]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[34]  Yu Tsao,et al.  Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[35]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[36]  Adam Finkelstein,et al.  Fftnet: A Real-Time Speaker-Dependent Neural Vocoder , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  Tomoki Toda,et al.  Postfilters to Modify the Modulation Spectrum for Statistical Parametric Speech Synthesis , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39]  Yonghong Yan,et al.  High Quality Voice Conversion through Phoneme-Based Linear Mapping Functions with STRAIGHT for Mandarin , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[40]  Lauri Juvela,et al.  Non-parallel voice conversion using i-vector PLDA: towards unifying speaker verification and transformation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Sungwon Kim,et al.  FloWaveNet : A Generative Flow for Raw Audio , 2018, ICML.

[42]  Hyunsoo Kim,et al.  Learning to Discover Cross-Domain Relations with Generative Adversarial Networks , 2017, ICML.

[43]  Junichi Yamagishi,et al.  The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods , 2018, Odyssey.

[44]  Tetsuya Takiguchi,et al.  Exemplar-Based Voice Conversion Using Sparse Representation in Noisy Environments , 2013, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[45]  David Barber,et al.  Information Maximization in Noisy Channels : A Variational Approach , 2003, NIPS.

[46]  John-Paul Hosom,et al.  Improving the intelligibility of dysarthric speech , 2007, Speech Commun..

[47]  Tetsuya Takiguchi,et al.  Voice Conversion Based on Speaker-Dependent Restricted Boltzmann Machines , 2014, IEICE Trans. Inf. Syst..

[48]  Li-Rong Dai,et al.  Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[49]  Ping Tan,et al.  DualGAN: Unsupervised Dual Learning for Image-to-Image Translation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[50]  Wei Ping,et al.  ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[51]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[52]  Hirokazu Kameoka,et al.  CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[53]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[54]  Xin Wang,et al.  Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[56]  Kou Tanaka,et al.  ACVAE-VC: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder , 2018, ArXiv.

[57]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[58]  Shinnosuke Takamichi,et al.  Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59]  Tomoki Toda,et al.  sprocket: Open-Source Voice Conversion Software , 2018, Odyssey.

[60]  Seyed Hamidreza Mohammadi,et al.  Voice conversion using deep neural networks with speaker-independent pre-training , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).