Voice conversion with SI-DNN and KL divergence based mapping without parallel training data

Abstract We propose a Speaker Independent Deep Neural Net (SI-DNN) and Kullback- Leibler Divergence (KLD) based mapping approach to voice conversion without using parallel training data. The acoustic difference between source and target speakers is equalized with SI-DNN via its estimated output posteriors, which serve as a probabilistic mapping from acoustic input frames to the corresponding symbols in the phonetic space. KLD is chosen as an ideal distortion measure to find an appropriate mapping from each input source speaker’s frame to that of the target speaker. The mapped acoustic segments of the target speaker form the construction bases for voice conversion. With or without word transcriptions of the target speaker’s training data, the approach can be either supervised or unsupervised. In a supervised mode where adequate training data can be utilized to train a conventional, statistical parametric TTS of the target speaker, each input frame of the source speaker is converted to its nearest sub-phonemic “senone”. In an unsupervised mode, the frame is converted to the nearest clustered phonetic centroid or a raw speech frame, in the minimum KLD sense. The acoustic trajectory of the converted voice is rendered with the maximum probability trajectory generation algorithm. Both objective and subjective measures used for evaluating voice conversion performance show that the new algorithm performs better than the sequential error minimization based DNN baseline trained with parallel training data.

[1]  Athanasios Mouchtaris,et al.  Nonparallel training for voice conversion based on a parameter adaptation approach , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Xia Wang,et al.  Supervisory Data Alignment for Text-Independent Voice Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[4]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[5]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[6]  Ahmad Akbari,et al.  A new wavelet thresholding method for speech enhancement based on symmetric Kullback-Leibler divergence , 2009, 2009 14th International CSI Computer Conference.

[7]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[8]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[9]  Haifeng Li,et al.  Sequence error (SE) minimization training of neural network for voice conversion , 2014, INTERSPEECH.

[10]  Haizhou Li,et al.  Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Frank K. Soong,et al.  Measuring attribute dissimilarity with HMM KL-divergence for speech synthesis , 2007, SSW.

[12]  Tomoki Toda,et al.  The NU-NAIST Voice Conversion System for the Voice Conversion Challenge 2016 , 2016, INTERSPEECH.

[13]  Tomoki Toda,et al.  The Voice Conversion Challenge 2016 , 2016, INTERSPEECH.

[14]  Frank K. Soong,et al.  A Cross-Language State Sharing and Mapping Approach to Bilingual (Mandarin–English) TTS , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Li-Rong Dai,et al.  Minimum Kullback–Leibler Divergence Parameter Generation for HMM-Based Speech Synthesis , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Jun Du,et al.  A New Minimum Divergence Approach to Discriminative Training , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[17]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[18]  Hao Wang,et al.  Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[19]  Chung-Hsien Wu,et al.  Map-based adaptation for speech conversion using adaptation data selection and non-parallel training , 2006, INTERSPEECH.

[20]  Athanasios Mouchtaris,et al.  A Spectral Conversion Approach to Single-Channel Speech Enhancement , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Li-Rong Dai,et al.  Joint spectral distribution modeling using restricted boltzmann machines for voice conversion , 2013, INTERSPEECH.

[22]  Kishore Prahallad,et al.  Spectral Mapping Using Artificial Neural Networks for Voice Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Frank K. Soong,et al.  Optimal clustering of multivariate normal distributions using divergence and its application to HMM adaptation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[24]  Hermann Ney,et al.  Text-Independent Voice Conversion Based on Unit Selection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[25]  Li-Rong Dai,et al.  Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Seyed Hamidreza Mohammadi,et al.  An overview of voice conversion systems , 2017, Speech Commun..

[28]  Yiming Wang,et al.  Low Latency Acoustic Modeling Using Temporal Convolution and LSTMs , 2018, IEEE Signal Processing Letters.

[29]  Kun Li,et al.  Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Biing-Hwang Juang,et al.  Line spectrum pair (LSP) and speech data compression , 1984, ICASSP.

[31]  Bayya Yegnanarayana,et al.  Transformation of formants for voice conversion using artificial neural networks , 1995, Speech Commun..

[32]  Hermann Ney,et al.  A first step towards text-independent voice conversion , 2004, INTERSPEECH.

[33]  Haizhou Li,et al.  Conditional restricted Boltzmann machine for voice conversion , 2013, 2013 IEEE China Summit and International Conference on Signal and Information Processing.

[34]  Simon King,et al.  Modelling the uncertainty in recovering articulation from acoustics , 2003, Comput. Speech Lang..

[35]  Haizhou Li,et al.  Exemplar-based sparse representation of timbre and prosody for voice conversion , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[37]  Alex Acero,et al.  Robust bandwidth extension of noise-corrupted narrowband speech , 2005, INTERSPEECH.

[38]  Daniel Erro,et al.  INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Haifeng Li,et al.  A KL divergence and DNN approach to cross-lingual TTS , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Frank K. Soong,et al.  A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL) , 2013, INTERSPEECH.

[41]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[43]  Hyung Soon Kim,et al.  Narrowband to wideband conversion of speech using GMM based transformation , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[44]  Koby Crammer,et al.  Non-parallel voice conversion using joint optimization of alignment by temporal context and spectral distortion , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[46]  Tomoki Toda,et al.  Voice conversion for various types of body transmitted speech , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.