Part-Syllable Transformation-Based Voice Conversion with Very Limited Training Data

Voice conversion suffers from two drawbacks: requiring a large number of sentences from target speaker and concatenation error (in concatenative methods). In this research, part-syllable transformation-based voice conversion (PST-VC) method, which performs voice conversion with very limited data from a target speaker and simultaneously reduces concatenation error, is introduced. In this method, every syllable is segmented into three parts: left transition, vowel core, and right transition. Using this new language unit called part-syllable (PS), PST-VC, reduces concatenation error by transferring segmentation and concatenation from the transition points to the relatively stable points of a syllable. Since the greatest amount of information from any speaker is contained in the vowels, PST-VC method uses this information to transform the vowels into all of the language PSs. In this approach, a series of transformations are trained that can generate all of the PSs of a target speaker’s voice by receiving one vowel core as the input. Having all of the PSs, any voice of target speaker can be imitated. Therefore, PST-VC reduces the number of training sentences needed to a single-syllable word and also reduces the concatenation error.

[1]  Stefan Winkler,et al.  Mean opinion score (MOS) revisited: methods and applications, limitations and alternatives , 2016, Multimedia Systems.

[2]  Hamid Sheikhzadeh,et al.  FDMSM robust signal representation for speech mixtures and noise corrupted audio signals , 2009, IEICE Electron. Express.

[3]  K.-S. Lee,et al.  Statistical Approach for Voice Personality Transformation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Daniel Erro,et al.  Weighted frequency warping for voice conversion , 2007, INTERSPEECH.

[6]  Kenji Kimura,et al.  A Study on Restoration of Bone-Conducted Speech with MTF-Based and LP-Based Models (Special Issue on Nonlinear Circuits and Signal Processing) , 2006 .

[7]  Kishore Prahallad,et al.  Voice conversion using Artificial Neural Networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Ning Xu,et al.  Voice conversion based on Gaussian processes by coherent and asymmetric training with limited training data , 2014, Speech Commun..

[9]  Keiichi Tokuda,et al.  Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10]  Michael Picheny,et al.  Towards Pooled-Speaker Concatenative Text-to-Speech , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Patrick A. Naylor,et al.  Speech Dereverberation , 2010 .

[13]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[14]  K. Shikano,et al.  Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[15]  C StreijlRobert,et al.  Mean opinion score (MOS) revisited , 2016 .

[16]  B. Yegnanarayana,et al.  Voice conversion: Factors responsible for quality , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[18]  A. Gray,et al.  Distortion performance of vector quantization for LPC voice coding , 1982 .

[19]  John S. Collura Vector Quantization of Linear Predictor Coefficients , 1995 .

[20]  Moncef Gabbouj,et al.  Voice Conversion Using Dynamic Kernel Partial Least Squares Regression , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Heiga Zen,et al.  An HMM-based singing voice synthesis system , 2006, INTERSPEECH.

[22]  Kuldip K. Paliwal,et al.  Efficient vector quantization of LPC parameters at 24 bits/frame , 1993, IEEE Trans. Speech Audio Process..

[23]  Hamid Sheikhzadeh,et al.  Voice conversion based on feature combination with limited training data , 2015, Speech Commun..

[24]  Xian-tong Chen,et al.  High-quality voice conversion system based on GMM statistical parameters and RBF neural network , 2014 .

[25]  Inma Hernáez,et al.  Iterative MMSE Estimation of Vocal Tract Length Normalization Factors for Voice Transformation , 2012, INTERSPEECH.

[26]  Inma Hernáez,et al.  Parametric Voice Conversion Based on Bilinear Frequency Warping Plus Amplitude Scaling , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Tomoki Toda,et al.  Eigenvoice conversion based on Gaussian mixture model , 2006, INTERSPEECH.

[28]  Hermann Ney,et al.  Text-Independent Voice Conversion Based on Unit Selection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[29]  Tomoki Toda,et al.  Evaluation of Extremely Small Sound Source Signals Used in Speaking-Aid System with Statistical Voice Conversion , 2010, IEICE Trans. Inf. Syst..

[30]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[31]  Tetsuya Takiguchi,et al.  Voice Conversion Using RNN Pre-Trained by Recurrent Temporal Restricted Boltzmann Machines , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  T. Nagarajan,et al.  A Multi-level GMM-Based Cross-Lingual Voice Conversion Using Language-Specific Mixture Weights for Polyglot Synthesis , 2015, Circuits, Systems, and Signal Processing.

[33]  Abolghasem Sayadiyan,et al.  A new perceptually weighted distance measure for vector quantization of the STFT amplitudes in the speech application , 2009, IEICE Electron. Express.

[34]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[35]  Tetsuya Takiguchi,et al.  Voice conversion using speaker-dependent conditional restricted Boltzmann machine , 2015, EURASIP Journal on Audio, Speech, and Music Processing.

[36]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[38]  Li-Rong Dai,et al.  Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.