Accent and Speaker Disentanglement in Many-to-many Voice Conversion

This paper proposes an interesting voice and accent joint conversion approach, which can convert an arbitrary source speaker's voice to a target speaker with non-native accent. This problem is challenging as each target speaker only has training data in native accent and we need to disentangle accent and speaker information in the conversion model training and re-combine them in the conversion stage. In our recognition-synthesis conversion framework, we manage to solve this problem by two proposed tricks. First, we use accent-dependent speech recognizers to obtain bottleneck features for different accented speakers. This aims to wipe out other factors beyond the linguistic information in the BN features for conversion model training. Second, we propose to use adversarial training to better disentangle the speaker and accent information in our encoder-decoder based conversion model. Specifically, we plug an auxiliary speaker classifier to the encoder, trained with an adversarial loss to wipe out speaker information from the encoder output. Experiments show that our approach is superior to the baseline. The proposed tricks are quite effective in improving accentedness and audio quality and speaker similarity are well maintained.

[1]  Hung-yi Lee,et al.  One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization , 2019, INTERSPEECH.

[2]  Lei Xie,et al.  Data Efficient Voice Cloning from Noisy Samples with Domain Adversarial Training , 2020, INTERSPEECH.

[3]  Songxiang Liu,et al.  End-To-End Accent Conversion Without Using Native Utterances , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Kun Li,et al.  Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Hao Wang,et al.  Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[6]  Yuxuan Wang,et al.  Adversarial Feature Learning and Unsupervised Clustering Based Speech Synthesis for Found Data With Acoustic and Textual Noise , 2020, IEEE Signal Processing Letters.

[7]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[9]  Jianwei Yu,et al.  End-to-end Code-switched TTS with Mix of Monolingual Recordings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Ricardo Gutierrez-Osuna,et al.  Can voice conversion be used to reduce non-native accents? , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Ricardo Gutierrez-Osuna,et al.  Accent Conversion Using Phonetic Posteriorgrams , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[13]  Chng Eng Siong,et al.  A Speaker-Dependent WaveNet for Voice Conversion with Non-Parallel Data , 2019, INTERSPEECH.

[14]  Ricardo Gutierrez-Osuna,et al.  Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams , 2019, INTERSPEECH.

[15]  Haizhou Li,et al.  A Modularized Neural Network with Language-Specific Output Layers for Cross-Lingual Voice Conversion , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[16]  Haizhou Li,et al.  Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Li-Rong Dai,et al.  Sequence-to-Sequence Acoustic Modeling for Voice Conversion , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Jan Skoglund,et al.  LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Mark Hasegawa-Johnson,et al.  Zero-Shot Voice Style Transfer with Only Autoencoder Loss , 2019, ICML.

[20]  Daniel Erro,et al.  Voice Conversion Based on Weighted Frequency Warping , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Olivier Rosec,et al.  Voice Conversion Using Dynamic Frequency Warping With Amplitude Scaling, for Parallel or Nonparallel Corpora , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Yu Tsao,et al.  Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[23]  Kishore Prahallad,et al.  Voice conversion using Artificial Neural Networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Tetsuya Takiguchi,et al.  Exemplar-based voice conversion in noisy environment , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[26]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[27]  Li-Rong Dai,et al.  Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[29]  H. Ney,et al.  VTLN-based voice conversion , 2003, Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology (IEEE Cat. No.03EX795).

[30]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Jian Cheng,et al.  Additive Margin Softmax for Face Verification , 2018, IEEE Signal Processing Letters.