论文信息 - Part-Syllable Transformation-Based Voice Conversion with Very Limited Training Data

Part-Syllable Transformation-Based Voice Conversion with Very Limited Training Data

Voice conversion suffers from two drawbacks: requiring a large number of sentences from target speaker and concatenation error (in concatenative methods). In this research, part-syllable transformation-based voice conversion (PST-VC) method, which performs voice conversion with very limited data from a target speaker and simultaneously reduces concatenation error, is introduced. In this method, every syllable is segmented into three parts: left transition, vowel core, and right transition. Using this new language unit called part-syllable (PS), PST-VC, reduces concatenation error by transferring segmentation and concatenation from the transition points to the relatively stable points of a syllable. Since the greatest amount of information from any speaker is contained in the vowels, PST-VC method uses this information to transform the vowels into all of the language PSs. In this approach, a series of transformations are trained that can generate all of the PSs of a target speaker’s voice by receiving one vowel core as the input. Having all of the PSs, any voice of target speaker can be imitated. Therefore, PST-VC reduces the number of training sentences needed to a single-syllable word and also reduces the concatenation error.

Abolghasem Sayadiyan | Mohammad Javad Jannati | A. Sayadiyan | M. Jannati

[1] Stefan Winkler,et al. Mean opinion score (MOS) revisited: methods and applications, limitations and alternatives , 2016, Multimedia Systems.

[2] Hamid Sheikhzadeh,et al. FDMSM robust signal representation for speech mixtures and noise corrupted audio signals , 2009, IEICE Electron. Express.

[3] K.-S. Lee,et al. Statistical Approach for Voice Personality Transformation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4] Moncef Gabbouj,et al. Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[5] Daniel Erro,et al. Weighted frequency warping for voice conversion , 2007, INTERSPEECH.

[6] Kenji Kimura,et al. A Study on Restoration of Bone-Conducted Speech with MTF-Based and LP-Based Models (Special Issue on Nonlinear Circuits and Signal Processing) , 2006 .

[7] Kishore Prahallad,et al. Voice conversion using Artificial Neural Networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8] Ning Xu,et al. Voice conversion based on Gaussian processes by coherent and asymmetric training with limited training data , 2014, Speech Commun..

[9] Keiichi Tokuda,et al. Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10] Michael Picheny,et al. Towards Pooled-Speaker Concatenative Text-to-Speech , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11] Tomoki Toda,et al. Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[12] Patrick A. Naylor,et al. Speech Dereverberation , 2010 .

[13] Hideki Kawahara,et al. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[14] K. Shikano,et al. Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[15] C StreijlRobert,et al. Mean opinion score (MOS) revisited , 2016 .

[16] B. Yegnanarayana,et al. Voice conversion: Factors responsible for quality , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17] Alexander Kain,et al. Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[18] A. Gray,et al. Distortion performance of vector quantization for LPC voice coding , 1982 .

[19] John S. Collura. Vector Quantization of Linear Predictor Coefficients , 1995 .

[20] Moncef Gabbouj,et al. Voice Conversion Using Dynamic Kernel Partial Least Squares Regression , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[21] Heiga Zen,et al. An HMM-based singing voice synthesis system , 2006, INTERSPEECH.

[22] Kuldip K. Paliwal,et al. Efficient vector quantization of LPC parameters at 24 bits/frame , 1993, IEEE Trans. Speech Audio Process..

[23] Hamid Sheikhzadeh,et al. Voice conversion based on feature combination with limited training data , 2015, Speech Commun..

[24] Xian-tong Chen,et al. High-quality voice conversion system based on GMM statistical parameters and RBF neural network , 2014 .

[25] Inma Hernáez,et al. Iterative MMSE Estimation of Vocal Tract Length Normalization Factors for Voice Transformation , 2012, INTERSPEECH.

[26] Inma Hernáez,et al. Parametric Voice Conversion Based on Bilinear Frequency Warping Plus Amplitude Scaling , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[27] Tomoki Toda,et al. Eigenvoice conversion based on Gaussian mixture model , 2006, INTERSPEECH.

[28] Hermann Ney,et al. Text-Independent Voice Conversion Based on Unit Selection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[29] Tomoki Toda,et al. Evaluation of Extremely Small Sound Source Signals Used in Speaking-Aid System with Statistical Voice Conversion , 2010, IEICE Trans. Inf. Syst..

[30] Satoshi Nakamura,et al. Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[31] Tetsuya Takiguchi,et al. Voice Conversion Using RNN Pre-Trained by Recurrent Temporal Restricted Boltzmann Machines , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32] T. Nagarajan,et al. A Multi-level GMM-Based Cross-Lingual Voice Conversion Using Language-Specific Mixture Weights for Polyglot Synthesis , 2015, Circuits, Systems, and Signal Processing.

[33] Abolghasem Sayadiyan,et al. A new perceptually weighted distance measure for vector quantization of the STFT amplitudes in the speech application , 2009, IEICE Electron. Express.

[34] Biing-Hwang Juang,et al. Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[35] Tetsuya Takiguchi,et al. Voice conversion using speaker-dependent conditional restricted Boltzmann machine , 2015, EURASIP Journal on Audio, Speech, and Music Processing.

[36] Yi Hu,et al. Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[37] Eric Moulines,et al. Voice transformation using PSOLA technique , 1991, Speech Commun..

[38] Li-Rong Dai,et al. Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.