A Modularized Neural Network with Language-Specific Output Layers for Cross-Lingual Voice Conversion

This paper presents a cross-lingual voice conversion framework that adopts a modularized neural network. The modularized neural network has a common input structure that is shared for both languages, and two separate output modules, one for each language. The idea is motivated by the fact that phonetic systems of languages are similar because humans share a common vocal production system, but acoustic renderings, such as prosody and phonotactic, vary a lot from language to language. The modularized neural network is trained to map Phonetic PosteriorGram (PPG) to acoustic features for multiple speakers. It is conditioned on a speaker i-vector to generate the desired target voice. We validated the idea between English and Mandarin languages in objective and subjective tests. In addition, mixed-lingual PPG derived from a unified English-Mandarin acoustic model is proposed to capture the linguistic information from both languages. It is found that our proposed modularized neural network significantly outperforms the baseline approaches in terms of speech quality and speaker individuality, and mixed-lingual PPG representation further improves the conversion performance.

[1]  Hao Wang,et al.  Personalized, Cross-Lingual TTS Using Phonetic Posteriorgrams , 2016, INTERSPEECH.

[2]  H. Ney,et al.  VTLN-based cross-language voice conversion , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[3]  Kou Tanaka,et al.  Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yu Zhang,et al.  A Survey on Multi-Task Learning , 2017, IEEE Transactions on Knowledge and Data Engineering.

[5]  Bin Ma,et al.  Spoken Language Recognition: From Fundamentals to Practice , 2013, Proceedings of the IEEE.

[6]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Xunying Liu,et al.  Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance , 2018, INTERSPEECH.

[8]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[9]  Lianhong Cai,et al.  Learning cross-lingual knowledge with multilingual BLSTM for emphasis detection with limited training data , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Frank K. Soong,et al.  Speaker and language factorization in DNN-based TTS synthesis , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Tomoki Toda,et al.  The Voice Conversion Challenge 2016 , 2016, INTERSPEECH.

[12]  Zhenye Gan,et al.  Mandarin-Tibetan Cross-Lingual Voice Conversion System Based on Deep Neural Network , 2018, CSAI '18.

[13]  Keiichi Tokuda,et al.  Incorporating a mixed excitation model and postfilter into HMM-based text-to-speech synthesis , 2005 .

[14]  Antonio Bonafonte,et al.  Voice conversion using k-histograms and frame selection , 2009, INTERSPEECH.

[15]  Haizhou Li,et al.  Sparse representation of phonetic features for voice conversion with and without parallel data , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[16]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[18]  Haifeng Li,et al.  A KL Divergence and DNN-Based Approach to Voice Conversion without Parallel Training Sentences , 2016, INTERSPEECH.

[19]  Shinnosuke Takamichi,et al.  Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Yi Zhou,et al.  Many-to-many Cross-lingual Voice Conversion with a Jointly Trained Speaker Embedding Network , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[21]  Masanobu Abe,et al.  Cross-language voice conversion , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[22]  Daniel Erro,et al.  Frame alignment method for cross-lingual voice conversion , 2007, INTERSPEECH.

[23]  T. Nagarajan,et al.  A Multi-level GMM-Based Cross-Lingual Voice Conversion Using Language-Specific Mixture Weights for Polyglot Synthesis , 2015, Circuits, Systems, and Signal Processing.

[24]  Tomoki Toda,et al.  Cross-language voice conversion based on eigenvoices , 2009, INTERSPEECH.

[25]  Yu Tsao,et al.  Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[26]  Daniel Erro,et al.  Voice Conversion of Non-aligned Data using Unit Selection , 2006 .

[27]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[28]  Heiga Zen,et al.  Multi-Language Multi-Speaker Acoustic Modeling for LSTM-RNN Based Statistical Parametric Speech Synthesis , 2016, INTERSPEECH.

[29]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[30]  Hao Wang,et al.  Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[31]  Frank K. Soong,et al.  A frame mapping based HMM approach to cross-lingual voice transformation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Haizhou Li,et al.  Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet , 2019, INTERSPEECH.

[33]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[34]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[35]  Yi Zhou,et al.  Speaker-independent Spectral Mapping for Speech-to-Singing Conversion , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[36]  Hermann Ney,et al.  Text-Independent Voice Conversion Based on Unit Selection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[37]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[38]  Chng Eng Siong,et al.  A Speaker-Dependent WaveNet for Voice Conversion with Non-Parallel Data , 2019, INTERSPEECH.

[39]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[40]  Haizhou Li,et al.  A Voice Conversion Framework with Tandem Feature Sparse Representation and Speaker-Adapted WaveNet Vocoder , 2018, INTERSPEECH.

[41]  Haizhou Li,et al.  Adaptive Wavenet Vocoder for Residual Compensation in GAN-Based Voice Conversion , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[42]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[43]  Mirjam Wester,et al.  The EMIME Mandarin Bilingual Database , 2011 .

[44]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[45]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[46]  Haizhou Li,et al.  Average Modeling Approach to Voice Conversion with Non-Parallel Data , 2018, Odyssey.

[47]  Kishore Prahallad,et al.  A Framework for Cross-Lingual Voice Conversion using Articial Neural Networks , 2009 .

[48]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[49]  K. Shikano,et al.  Statistical analysis of bilingual speaker's speech for cross-language voice conversion. , 1991, The Journal of the Acoustical Society of America.

[50]  Hui Bu,et al.  AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale , 2018, ArXiv.

[51]  Haizhou Li,et al.  Cross-lingual Voice Conversion with Bilingual Phonetic Posteriorgram and Average Modeling , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52]  Haizhou Li,et al.  Group Sparse Representation With WaveNet Vocoder Adaptation for Spectrum and Prosody Conversion , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[53]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Daniel Erro,et al.  INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[55]  Keiichi Tokuda,et al.  Incorporating a mixed excitation model and postfilter into HMM-based text-to-speech synthesis , 2005, Systems and Computers in Japan.