A Study of Bilinear Models in Voice Conversion

This paper presents a voice conversion technique based on bilinear models and introduces the concept of contextual modeling. The bilinear approach reformulates the spectral envelope representation from line spectral frequencies feature to a two-factor parameterization corresponding to speaker identity and phonetic information, the so-called style and content factors. This decomposition offers a flexible representation suitable for voice conversion and facilitates the use of efficient training algorithms based on singular value decomposition. In a contextual approach (bilinear) models are trained on subsets of the training data selected on the fly at conversion time depending on the characteristics of the feature vector to be converted. The performance of bilinear models and context modeling is evaluated in objective and perceptual tests by comparison with the popular GMM-based voice conversion method for several sizes and different types of training data.

[1]  Bayya Yegnanarayana,et al.  Transformation of formants for voice conversion using artificial neural networks , 1995, Speech Commun..

[2]  Yung-Hwan Oh,et al.  Hidden Markov model based voice conversion using dynamic characteristics of speaker , 1997, EUROSPEECH.

[3]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Moncef Gabbouj,et al.  LSF mapping for voice conversion with very small training sets , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Moncef Gabbouj,et al.  A novel technique for voice conversion based on style and content decomposition with bilinear models , 2009, INTERSPEECH.

[6]  Moncef Gabbouj,et al.  Analysis of LSF frame selection in voice conversion , 2009 .

[7]  Chandranath R. N. Athaudage,et al.  Optimization of a temporal decomposition model of speech , 1999, ISSPA '99. Proceedings of the Fifth International Symposium on Signal Processing and its Applications (IEEE Cat. No.99EX359).

[8]  Yuezhong Tang,et al.  A Parametric Approach for Voice Conversion , 2006 .

[9]  Masato Akagi,et al.  High-quality analysis/synthesis method based on temporal decomposition for speech modification , 2008, INTERSPEECH.

[10]  Mahesan Niranjan,et al.  Temporal decomposition: a framework for enhanced speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[11]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12]  Bishnu S. Atal,et al.  Efficient coding of LPC parameters by temporal decomposition , 1983, ICASSP.

[13]  Kuldip K. Paliwal,et al.  Efficient vector quantization of LPC parameters at 24 bits/frame , 1993, IEEE Trans. Speech Audio Process..

[14]  Nguyen Binh Phu Studies on spectral modification in voice transformation , 2009 .

[15]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Levent M. Arslan,et al.  Voice conversion by codebook mapping of line spectral frequencies and excitation spectrum , 1997, EUROSPEECH.

[17]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[18]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Tomoki Toda,et al.  Eigenvoice conversion based on Gaussian mixture model , 2006, INTERSPEECH.

[20]  P. C. Nguyen Modified Restricted Temporal Decomposition and Its Application to Low Rate Speech Coding , 2003 .

[21]  Gerrit Bloothooft,et al.  A breakpoint analysis procedure based on temporal decomposition , 1994, IEEE Trans. Speech Audio Process..

[22]  Tu Bao Ho,et al.  Temporal decomposition: a promising approach to VQ-based speaker identification , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[23]  Kuldip K. Paliwal,et al.  Interpolation properties of linear prediction parametric representations , 1995, EUROSPEECH.

[24]  Joshua B. Tenenbaum,et al.  Separating Style and Content with Bilinear Models , 2000, Neural Computation.