Non-Parallel Training in Voice Conversion Using an Adaptive Restricted Boltzmann Machine

In this paper, we present a voice conversion (VC) method that does not use any parallel data while training the model. VC is a technique where only speaker-specific information in source speech is converted while keeping the phonological information unchanged. Most of the existing VC methods rely on parallel data-pairs of speech data from the source and target speakers uttering the same sentences. However, the use of parallel data in training causes several problems: 1) the data used for the training are limited to the predefined sentences, 2) the trained model is only applied to the speaker pair used in the training, and 3) mismatches in alignment may occur. Although it is, thus, fairly preferable in VC not to use parallel data, a nonparallel approach is considered difficult to learn. In our approach, we achieve nonparallel training based on a speaker adaptation technique and capturing latent phonological information. This approach assumes that speech signals are produced from a restricted Boltzmann machine-based probabilistic model, where phonological information and speaker-related information are defined explicitly. Speaker-independent and speaker-dependent parameters are simultaneously trained under speaker adaptive training. In the conversion stage, a given speech signal is decomposed into phonological and speaker-related information, the speaker-related information is replaced with that of the desired speaker, and then voice-converted speech is obtained by mixing the two. Our experimental results showed that our approach outperformed another nonparallel approach, and produced results similar to those of the popular conventional Gaussian mixture models-based method that used parallel data in subjective and objective criteria.

[1]  Moncef Gabbouj,et al.  Voice conversion for non-parallel datasets using dynamic kernel partial least squares regression , 2013, INTERSPEECH.

[2]  Kishore Prahallad,et al.  A Framework for Cross-Lingual Voice Conversion using Articial Neural Networks , 2009 .

[3]  Daniel Erro,et al.  INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Haizhou Li,et al.  Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Mark J. F. Gales,et al.  Multiple-cluster adaptive training schemes , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[6]  Athanasios Mouchtaris,et al.  Nonparallel training for voice conversion based on a parameter adaptation approach , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Tetsuya Takiguchi,et al.  Voice conversion in high-order eigen space using deep belief nets , 2013, INTERSPEECH.

[8]  Tomoki Toda,et al.  Eigenvoice conversion based on Gaussian mixture model , 2006, INTERSPEECH.

[9]  Haizhou Li,et al.  Mixture of Factor Analyzers Using Priors From Non-Parallel Speech for Voice Conversion , 2012, IEEE Signal Processing Letters.

[10]  Tetsuya Takiguchi,et al.  Voice Conversion Using RNN Pre-Trained by Recurrent Temporal Restricted Boltzmann Machines , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Hermann Ney,et al.  Text-independent cross-language voice conversion , 2006, INTERSPEECH.

[12]  Kishore Prahallad,et al.  Spectral Mapping Using Artificial Neural Networks for Voice Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[15]  Li-Rong Dai,et al.  Joint spectral distribution modeling using restricted boltzmann machines for voice conversion , 2013, INTERSPEECH.

[16]  Li Deng,et al.  High-performance robust speech recognition using stereo training data , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[17]  Peng Song,et al.  Non-parallel training for voice conversion based on adaptation method , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[19]  Hermann Ney,et al.  Vocal tract normalization equals linear transformation in cepstral space , 2001, IEEE Transactions on Speech and Audio Processing.

[20]  Tetsuya Takiguchi,et al.  Noise-robust voice conversion based on spectral mapping on sparse space , 2013, SSW.

[21]  Koichi Shinoda,et al.  Vocal tract length normalization using rapid maximum-likelihood estimation for speech recognition , 2002, Systems and Computers in Japan.

[22]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Tetsuya Takiguchi,et al.  Exemplar-based voice conversion in noisy environment , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[24]  Tomoki Toda,et al.  Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech , 2012, Speech Commun..

[25]  Chung-Hsien Wu,et al.  Map-based adaptation for speech conversion using adaptation data selection and non-parallel training , 2006, INTERSPEECH.

[26]  Haizhou Li,et al.  Conditional restricted Boltzmann machine for voice conversion , 2013, 2013 IEEE China Summit and International Conference on Signal and Information Processing.

[27]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[28]  Tapani Raiko,et al.  Improved Learning of Gaussian-Bernoulli Restricted Boltzmann Machines , 2011, ICANN.

[29]  Keikichi Hirose,et al.  Application of matrix variate Gaussian mixture model to statistical voice conversion , 2014, INTERSPEECH.

[30]  Li-Rong Dai,et al.  Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Keikichi Hirose,et al.  Speech generation from hand gestures based on space mapping , 2009, INTERSPEECH.

[32]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[33]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[34]  Xavier Rodet,et al.  Intonation Conversion from Neutral to Expressive Speech , 2011, INTERSPEECH.

[35]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[36]  Keikichi Hirose,et al.  One-to-Many Voice Conversion Based on Tensor Representation of Speaker Space , 2011, INTERSPEECH.

[37]  Alfred O. Hero,et al.  Efficient learning of sparse, distributed, convolutional feature representations for object recognition , 2011, 2011 International Conference on Computer Vision.

[38]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[39]  Thomas Schaaf,et al.  VTLN in the MFCC Domain: Band-Limited versus Local Interpolation , 2011, INTERSPEECH.