One-Shot Voice Conversion with Speaker-Agnostic StarGAN

In this work, we propose a variant of STARGAN for many-tomany voice conversion (VC) conditioned on the d-vectors for short-duration (2-15 seconds) speech. We make several modifications to the STARGAN training and employ new network architectures. We employ a transformer encoder in the discriminator network, and we apply the discriminator loss to the cycle consistency and identity samples in addition to the generated (fake) samples. Instead of classifying the samples as either real or fake, our discriminator tries to predict the categorical speaker class, where a fake class is added for the generated samples. Furthermore, we employ a reverse gradient layer after the generator’s encoder and use an auxiliary classifier to remove the speaker’s information from the encoded representation. We show that our method yields better results than the baseline method in objective and subjective evaluations in terms of voice conversion quality. Moreover, we provide an ablation study and show each component’s influence on speaker similarity.

[1]  Yongguo Kang,et al.  Multi-reference Tacotron by Intercross Training for Style Disentangling, Transfer and Control in Speech Synthesis , 2019, ArXiv.

[2]  Yifan Gong,et al.  CNN with Phonetic Attention for Text-Independent Speaker Verification , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[3]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Shinnosuke Takamichi,et al.  Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Changjie Fan,et al.  One-Shot Voice Conversion Using Star-Gan , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[7]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[8]  Simon King,et al.  An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Augustus Odena,et al.  Semi-Supervised Learning with Generative Adversarial Networks , 2016, ArXiv.

[11]  Jonathon Shlens,et al.  A Learned Representation For Artistic Style , 2016, ICLR.

[12]  Constantine Bekas,et al.  BAGAN: Data Augmentation with Balancing GAN , 2018, ArXiv.

[13]  Kou Tanaka,et al.  StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[14]  Dongsuk Yook,et al.  Many-To-Many Voice Conversion Using Conditional Cycle-Consistent Adversarial Networks , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  H. Kameoka,et al.  Nonparallel Voice Conversion With Augmented Classifier Star Generative Adversarial Networks , 2020, IEEE/ACM Transactions on Audio Speech and Language Processing.

[17]  Jung-Woo Ha,et al.  StarGAN v2: Diverse Image Synthesis for Multiple Domains , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[19]  Junichi Yamagishi,et al.  The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods , 2018, Odyssey.

[20]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[21]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[22]  Lin-Shan Lee,et al.  Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations , 2018, INTERSPEECH.

[23]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[24]  Jung-Woo Ha,et al.  StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Harshad Rai,et al.  Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks , 2018 .

[26]  Kou Tanaka,et al.  StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion , 2019, INTERSPEECH.

[27]  Mark Hasegawa-Johnson,et al.  Zero-Shot Voice Style Transfer with Only Autoencoder Loss , 2019, ICML.

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29]  Mark Hasegawa-Johnson,et al.  F0-Consistent Many-To-Many Non-Parallel Voice Conversion Via Conditional Autoencoder , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Yu Tsao,et al.  Voice conversion from non-parallel corpora using variational auto-encoder , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[31]  Hung-yi Lee,et al.  One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization , 2019, INTERSPEECH.

[32]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[33]  Yu Tsao,et al.  MOSNet: Deep Learning based Objective Assessment for Voice Conversion , 2019, INTERSPEECH.