Voice Conversion Using RNN Pre-Trained by Recurrent Temporal Restricted Boltzmann Machines

This paper presents a voice conversion (VC) method that utilizes the recently proposed probabilistic models called recurrent temporal restricted Boltzmann machines (RTRBMs). One RTRBM is used for each speaker, with the goal of capturing high-order temporal dependencies in an acoustic sequence. Our algorithm starts from the separate training of one RTRBM for a source speaker and another for a target speaker using speaker-dependent training data. Because each RTRBM attempts to discover abstractions to maximally express the training data at each time step, as well as the temporal dependencies in the training data, we expect that the models represent the linguistic-related latent features in high-order spaces. In our approach, we convert (match) features of emphasis for the source speaker to those of the target speaker using a neural network (NN), so that the entire network (consisting of the two RTRBMs and the NN) acts as a deep recurrent NN and can be fine-tuned. Using VC experiments, we confirm the high performance of our method, especially in terms of objective criteria, relative to conventional VC methods such as approaches based on Gaussian mixture models and on NNs.

[1]  B. Schölkopf,et al.  Modeling Human Motion Using Binary Latent Variables , 2007 .

[2]  Hermann Ney,et al.  A Deep Learning Approach to Machine Transliteration , 2009, WMT@EACL.

[3]  Tetsuya Takiguchi,et al.  Voice conversion in high-order eigen space using deep belief nets , 2013, INTERSPEECH.

[4]  Li-Rong Dai,et al.  Minimum Kullback–Leibler Divergence Parameter Generation for HMM-Based Speech Synthesis , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Li-Rong Dai,et al.  Joint spectral distribution modeling using restricted boltzmann machines for voice conversion , 2013, INTERSPEECH.

[6]  Tomoki Toda,et al.  Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech , 2012, Speech Commun..

[7]  Ren-Hua Wang,et al.  USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method , 2006, Blizzard Challenge.

[8]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[9]  Hideki Kawahara,et al.  Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[13]  Geoffrey E. Hinton,et al.  The Recurrent Temporal Restricted Boltzmann Machine , 2008, NIPS.

[14]  Keikichi Hirose,et al.  One-to-Many Voice Conversion Based on Tensor Representation of Speaker Space , 2011, INTERSPEECH.

[15]  Jiao Licheng,et al.  An Algorithm for SAR Image Embedded Compression Based on Wavelet Transform , 2007, Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007).

[16]  Chung-Hsien Wu,et al.  Map-based adaptation for speech conversion using adaptation data selection and non-parallel training , 2006, INTERSPEECH.

[17]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[18]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[19]  Tetsuya Takiguchi,et al.  Exemplar-based voice conversion in noisy environment , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[20]  Xu Shao,et al.  Speech reconstruction from mel-frequency cepstral coefficients using a source-filter model , 2002, INTERSPEECH.

[21]  Zhen Yang,et al.  Voice Conversion Using Canonical Correlation Analysis Based on Gaussian Mixture Model , 2007 .

[22]  Tomoki Toda,et al.  Eigenvoice conversion based on Gaussian mixture model , 2006, INTERSPEECH.

[23]  Haizhou Li,et al.  Exemplar-based voice conversion using non-negative spectrogram deconvolution , 2013, SSW.

[24]  Kishore Prahallad,et al.  Voice conversion using Artificial Neural Networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[26]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[27]  Tapani Raiko,et al.  Improved Learning of Gaussian-Bernoulli Restricted Boltzmann Machines , 2011, ICANN.

[28]  Ren-Hua Wang,et al.  Minimum segmentation error based discriminative training for speech synthesis application , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[30]  Nobuaki Minematsu,et al.  Probabilistic integration of joint density model and speaker model for voice conversion , 2010, INTERSPEECH.

[31]  David Haussler,et al.  Unsupervised learning of distributions on binary vectors using two layer networks , 1991, NIPS 1991.

[32]  Li Deng,et al.  High-performance robust speech recognition using stereo training data , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[33]  Xavier Rodet,et al.  Intonation Conversion from Neutral to Expressive Speech , 2011, INTERSPEECH.

[34]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[35]  Geoffrey E. Hinton,et al.  3D Object Recognition with Deep Belief Nets , 2009, NIPS.

[36]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..

[37]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[38]  Tetsuya Takiguchi,et al.  Voice conversion in time-invariant speaker-independent space , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Dong Yu,et al.  Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Haizhou Li,et al.  Conditional restricted Boltzmann machine for voice conversion , 2013, 2013 IEEE China Summit and International Conference on Signal and Information Processing.

[42]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[43]  Keikichi Hirose,et al.  Speech generation from hand gestures based on space mapping , 2009, INTERSPEECH.