Voice conversion using speaker-dependent conditional restricted Boltzmann machine

This paper presents a voice conversion (VC) method that utilizes conditional restricted Boltzmann machines (CRBMs) for each speaker to obtain high-order speaker-independent spaces where voice features are converted more easily than those in an original acoustic feature space. The CRBM is expected to automatically discover common features lurking in time-series data. When we train two CRBMs for a source and target speaker independently using only speaker-dependent training data, it can be considered that each CRBM tries to construct subspaces where there are fewer phonemes and relatively more speaker individuality than the original acoustic space because the training data include various phonemes while keeping the speaker individuality unchanged. Each obtained high-order feature is then concatenated using a neural network (NN) from the source to the target. The entire network (the two CRBMs and the NN) can be also fine-tuned as a recurrent neural network (RNN) using the acoustic parallel data since both the CRBMs and the concatenating NN have network-based representation with time dependencies. Through voice-conversion experiments, we confirmed the high performance of our method especially in terms of objective evaluation, comparing it with conventional GMM, NN, RNN, and our previous work, speaker-dependent DBN approaches.

[1]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..

[2]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[3]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Keikichi Hirose,et al.  Speech generation from hand gestures based on space mapping , 2009, INTERSPEECH.

[5]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[6]  Hermann Ney,et al.  A Deep Learning Approach to Machine Transliteration , 2009, WMT@EACL.

[7]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[8]  Chung-Hsien Wu,et al.  Map-based adaptation for speech conversion using adaptation data selection and non-parallel training , 2006, INTERSPEECH.

[9]  Nobuaki Minematsu,et al.  Probabilistic integration of joint density model and speaker model for voice conversion , 2010, INTERSPEECH.

[10]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[11]  David Haussler,et al.  Unsupervised learning of distributions on binary vectors using two layer networks , 1991, NIPS 1991.

[12]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[13]  Hideki Kawahara,et al.  Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Geoffrey E. Hinton,et al.  3D Object Recognition with Deep Belief Nets , 2009, NIPS.

[16]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[17]  Bayya Yegnanarayana,et al.  Transformation of formants for voice conversion using artificial neural networks , 1995, Speech Commun..

[18]  Li Deng,et al.  High-performance robust speech recognition using stereo training data , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[19]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[20]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Li-Rong Dai,et al.  Joint spectral distribution modeling using restricted boltzmann machines for voice conversion , 2013, INTERSPEECH.

[22]  Zhen Yang,et al.  Voice Conversion Using Canonical Correlation Analysis Based on Gaussian Mixture Model , 2007 .

[23]  Tomoki Toda,et al.  Eigenvoice conversion based on Gaussian mixture model , 2006, INTERSPEECH.

[24]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Tapani Raiko,et al.  Improved Learning of Gaussian-Bernoulli Restricted Boltzmann Machines , 2011, ICANN.

[26]  Tetsuya Takiguchi,et al.  Voice conversion in high-order eigen space using deep belief nets , 2013, INTERSPEECH.

[27]  Dong Yu,et al.  Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[29]  Haizhou Li,et al.  Conditional restricted Boltzmann machine for voice conversion , 2013, 2013 IEEE China Summit and International Conference on Signal and Information Processing.

[30]  Ren-Hua Wang,et al.  Minimum segmentation error based discriminative training for speech synthesis application , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[32]  Xavier Rodet,et al.  Intonation Conversion from Neutral to Expressive Speech , 2011, INTERSPEECH.

[33]  Haizhou Li,et al.  Exemplar-based voice conversion using non-negative spectrogram deconvolution , 2013, SSW.

[34]  Li-Rong Dai,et al.  Minimum Kullback–Leibler Divergence Parameter Generation for HMM-Based Speech Synthesis , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Tomoki Toda,et al.  Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech , 2012, Speech Commun..

[36]  Keikichi Hirose,et al.  One-to-Many Voice Conversion Based on Tensor Representation of Speaker Space , 2011, INTERSPEECH.

[37]  Geoffrey E. Hinton,et al.  Modeling Human Motion Using Binary Latent Variables , 2006, NIPS.

[38]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[39]  Philipp Slusallek,et al.  Introduction to real-time ray tracing , 2005, SIGGRAPH Courses.

[40]  Kishore Prahallad,et al.  Voice conversion using Artificial Neural Networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[41]  Tetsuya Takiguchi,et al.  Exemplar-based voice conversion in noisy environment , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[42]  Xu Shao,et al.  Speech reconstruction from mel-frequency cepstral coefficients using a source-filter model , 2002, INTERSPEECH.

[43]  Ren-Hua Wang,et al.  USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method , 2006, Blizzard Challenge.

[44]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..