Novel Adaptive Generative Adversarial Network for Voice Conversion

Voice Conversion (VC) converts the speaking style of a source speaker to the speaking style of a target speaker by preserving the linguistic content of a given speech utterance. Recently, Cycle Consistent Adversarial Network (CycleGAN), and its variants have become popular for non-parallel VC tasks. However, CycleGAN uses two different generators and discriminators. In this paper, we introduce a novel Adaptive Generative Adversarial Network (AdaGAN) for non-parallel VC task, which effectively requires single generator, and two discriminators for transferring the style from one speaker to another while preserving the linguistic content in the converted voices. To the best of authors' knowledge, this is the first study of its kind to introduce a new Generative Adversarial Network (GAN)-based architecture (i.e., AdaGAN) in machine learning literature, and the first attempt to apply this architecture for nonparallel VC task. In this paper, we compared the results of the AdaGAN w.r.t. state-of-the-art CycleGAN architecture. Detailed subjective and objective tests are carried out on the publicly available VC Challenge 2018 corpus. In addition, we perform three statistical analysis which show effectiveness of AdaGAN over CycleGAN for parallel-data free one-to-one VC. For inter-gender and intra-gender VC, We observe that the AdaGAN yield objective results that are comparable to the CycleGAN, and are superior in terms of subjective evaluation. A subjective evaluation shows that AdaGAN outperforms CycleGAN-VC in terms of naturalness, sound quality, and speaker similarity. AdaGAN was preferred 58.33% and 41% time more over CycleGAN in terms of speaker similarity and sound quality, respectively.

[1]  Hemant A. Patil,et al.  Unsupervised Vocal Tract Length Warped Posterior Features for Non-Parallel Voice Conversion , 2018, INTERSPEECH.

[2]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[3]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Hirokazu Kameoka,et al.  Sequence-to-Sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks , 2017, INTERSPEECH.

[5]  Daniel Erro,et al.  INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Athanasios Mouchtaris,et al.  Nonparallel training for voice conversion based on a parameter adaptation approach , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Seyed Hamidreza Mohammadi,et al.  An overview of voice conversion systems , 2017, Speech Commun..

[8]  Hui Ye,et al.  Quality-enhanced voice morphing using maximum likelihood transformations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Hirokazu Kameoka,et al.  Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks , 2017, ArXiv.

[10]  Hemant A. Patil,et al.  Effectiveness of Dynamic Features in INCA and Temporal Context-INCA , 2018, INTERSPEECH.

[11]  Madhu R. Kamble,et al.  Novel Amplitude Weighted Frequency Modulation Features for Replay Spoof Detection , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[12]  Junichi Yamagishi,et al.  The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods , 2018, Odyssey.

[13]  Hemant A. Patil,et al.  Effectiveness of Cross-Domain Architectures for Whisper-to-Normal Speech Conversion , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[14]  Hemant A. Patil,et al.  Analysis of Features and Metrics for Alignment in Text-Dependent Voice Conversion , 2017, PReMI.

[15]  Madhu R. Kamble,et al.  Novel Variable Length Teager Energy Separation Based Instantaneous Frequency Features for Replay Detection , 2017, INTERSPEECH.

[16]  Chung-Hsien Wu,et al.  Map-based adaptation for speech conversion using adaptation data selection and non-parallel training , 2006, INTERSPEECH.

[17]  Kun Li,et al.  Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[19]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[20]  Hemant A. Patil,et al.  Novel Inter Mixture Weighted GMM Posteriorgram for DNN and GAN-based Voice Conversion , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[21]  Kou Tanaka,et al.  StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[22]  Hemant A. Patil,et al.  A novel approach to remove outliers for parallel voice conversion , 2019, Comput. Speech Lang..

[23]  Madhu R. Kamble,et al.  Effectiveness of Mel Scale-Based ESA-IFCC Features for Classification of Natural vs. Spoofed Speech , 2017, PReMI.

[24]  Shinnosuke Takamichi,et al.  Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Shinnosuke Takamichi,et al.  Voice Conversion Using Input-to-Output Highway Networks , 2017, IEICE Trans. Inf. Syst..

[26]  Seyed Hamidreza Mohammadi,et al.  Voice conversion using deep neural networks with speaker-independent pre-training , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[27]  Lauri Juvela,et al.  Non-parallel voice conversion using i-vector PLDA: towards unifying speaker verification and transformation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[29]  Tetsuya Takiguchi,et al.  Voice Conversion Based on Speaker-Dependent Restricted Boltzmann Machines , 2014, IEICE Trans. Inf. Syst..

[30]  Petros Maragos,et al.  Speech nonlinearities, modulations, and energy operators , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[31]  Madhu R. Kamble,et al.  Effectiveness of Speech Demodulation-Based Features for Replay Detection , 2018, INTERSPEECH.

[32]  Zheng-Hua Tan,et al.  Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification , 2017, INTERSPEECH.

[33]  Shinnosuke Takamichi,et al.  Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34]  Li-Rong Dai,et al.  Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Hemant A. Patil,et al.  Novel Pre-processing using Outlier Removal in Voice Conversion , 2016, SSW.

[36]  Haizhou Li,et al.  Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[37]  Moncef Gabbouj,et al.  On the impact of alignment on voice conversion performance , 2008, INTERSPEECH.

[38]  Inma Hernáez,et al.  Improved HNM-Based Vocoder for Statistical Synthesizers , 2011, INTERSPEECH.

[39]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Hemant A. Patil,et al.  Effectiveness of Generative Adversarial Network for Non-Audible Murmur-to-Whisper Speech Conversion , 2018, INTERSPEECH.

[41]  Yu Tsao,et al.  Voice conversion from non-parallel corpora using variational auto-encoder , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[42]  Haifeng Li,et al.  A KL Divergence and DNN-Based Approach to Voice Conversion without Parallel Training Sentences , 2016, INTERSPEECH.

[43]  Jordi Bonada,et al.  Modeling and Transforming Speech Using Variational Autoencoders , 2016, INTERSPEECH.