StarGAN-VC+ASR: StarGAN-based Non-Parallel Voice Conversion Regularized by Automatic Speech Recognition

Preserving the linguistic content of input speech is essential during voice conversion (VC). The star generative adversarial network-based VC method (StarGAN-VC) is a recently developed method that allows non-parallel many-to-many VC. Although this method is powerful, it can fail to preserve the linguistic content of input speech when the number of available training samples is extremely small. To overcome this problem, we propose the use of automatic speech recognition to assist model training, to improve StarGAN-VC, especially in low-resource scenarios. Experimental results show that using our proposed method, StarGAN-VC can retain more linguistic information than vanilla StarGAN-VC.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Yonghong Yan,et al.  High Quality Voice Conversion through Phoneme-Based Linear Mapping Functions with STRAIGHT for Mandarin , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[3]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Björn Schuller,et al.  Stargan for Emotional Speech Conversion: Validated by Data Augmentation of End-To-End Emotion Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Yu Tsao,et al.  Voice conversion from non-parallel corpora using variational auto-encoder , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[6]  Ming-yang,et al.  The NUS & NWPU system for Voice Conversion Challenge 2020 , 2020, Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020.

[7]  Kishore Prahallad,et al.  Spectral Mapping Using Artificial Neural Networks for Voice Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Hirokazu Kameoka,et al.  CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[9]  Hirokazu Kameoka,et al.  Sequence-to-Sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks , 2017, INTERSPEECH.

[10]  Mark Hasegawa-Johnson,et al.  Zero-Shot Voice Style Transfer with Only Autoencoder Loss , 2019, ICML.

[11]  Hirokazu Kameoka,et al.  StarGAN-based Emotional Voice Conversion for Japanese Phrases , 2021, ArXiv.

[12]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[13]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[16]  Tatsuya Kawahara,et al.  Recent Development of Open-Source Speech Recognition Engine Julius , 2009 .

[17]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[18]  Yu Tsao,et al.  Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[19]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[20]  Kou Tanaka,et al.  StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[21]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[22]  Hao Wang,et al.  Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[23]  Masanori Morise,et al.  D4C, a band-aperiodicity estimator for high-quality speech synthesis , 2016, Speech Commun..

[24]  Lauri Juvela,et al.  Non-parallel voice conversion using i-vector PLDA: towards unifying speaker verification and transformation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..

[26]  Tomoki Toda,et al.  Speaker-Dependent WaveNet Vocoder , 2017, INTERSPEECH.