Segment Level Voice Conversion with Recurrent Neural Networks

Voice conversion techniques aim to modify a subject’s voice characteristics in order to mimic the one’s of another person. Due to the difference in utterance length between source and target speaker, state of the art voice conversion systems often rely on a frame alignment pre-processing step. This step aligns the entire utterances with algorithms such as dynamic time warping (DTW) that introduce errors, hindering system performance. In this paper we present a new technique that avoids the alignment of entire utterances at frame level, while keeping the local context during training. For this purpose, we combine an RNN model with the use of phoneme or syllablelevel information, obtained from a speech recognition system. This system segments the utterances into segments which then can be grouped into overlapping windows, providing the needed context for the model to learn the temporal dependencies. We show that with this approach, notable improvements can be attained over a state of the art RNN voice conversion system on the CMU ARCTIC database. It is also worth noting that with this technique it is possible to halve the training data size and still outperform the baseline.

[1]  Tetsuya Takiguchi,et al.  Voice conversion using speaker-dependent conditional restricted Boltzmann machine , 2015, EURASIP Journal on Audio, Speech, and Music Processing.

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  Moncef Gabbouj,et al.  Voice Conversion Using Dynamic Kernel Partial Least Squares Regression , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[9]  Yu Tsao,et al.  A Study of Mutual Information for GMM-Based Spectral Conversion , 2012, INTERSPEECH.

[10]  Kun Li,et al.  Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Yu Tsao,et al.  Incorporating global variance in the training phase of GMM-based voice conversion , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[12]  Paul Taylor,et al.  The architecture of the Festival speech synthesis system , 1998, SSW.

[13]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[14]  Haizhou Li,et al.  Exemplar-based voice conversion using joint nonnegative matrix factorization , 2015, Multimedia Tools and Applications.

[15]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[16]  Joan Bruna,et al.  Voice Conversion using Convolutional Neural Networks , 2016, ArXiv.

[17]  Zhi Zheng Wu,et al.  Spectral mapping for voice conversion , 2015 .

[18]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[19]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[20]  Chng Eng Siong,et al.  High quality voice conversion using prosodic and high-resolution spectral features , 2015, Multimedia Tools and Applications.