Sub‐syllable segment‐based voice conversion using spectral block clustering transformation functions

Abstract This paper presents a novel framework for voice conversion based on sub‐syllable spectral block clustering transformation functions. The speech signal is first transferred to a spectrum by Fast Fourier transform. A sonority measure is used to extract sub‐syllable segments from input utterances by computing the energy concentration measure among frequency components. According to the syllable structure of Mandarin, Hidden Markov Model based syllable clustering is used to deal with the variety among different syllables. Dynamic programming is applied to align the spectral blocks of the parallel corpus to constrain the mapping between the spectral unit of the source speaker and that of the listener speaker under the constraint that mapped unities should be constrained to the same sub‐syllable and sub‐band in the Mel‐scale filter bank. A content based image retrieval algorithm is employed to find the target spectral block in the transformation phase. This paper illustrates voice conversion by spectral block transformation that transfers the speech signal of the source speaker to that of the listener. Experimental results show that the proposed method is effective in voice conversion, and the discrimination with regard to speaker identification is better than with traditional approaches. However, there remain additional noises, especially in high frequency components, which reduce the signal quality carried in the transformation phase, due to the fact that speech is not smooth.

[1]  Qiuqi Ruan,et al.  Independent Gabor Analysis of Discriminant Features Fusion for Face Recognition , 2009, IEEE Signal Processing Letters.

[2]  Mu-Liang Wang,et al.  Classified vector quantization of LPC parameters , 2007 .

[3]  Athanasios Mouchtaris,et al.  Nonparallel training for voice conversion based on a parameter adaptation approach , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Jani Nurminen,et al.  Novel method for data clustering and mode selection with application in voice conversion , 2006, INTERSPEECH.

[5]  Chung-Hsien Wu,et al.  Variable-Length Unit Selection in TTS Using Structural Syntactic Cost , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Levent M. Arslan,et al.  Subband based voice conversion , 2002, INTERSPEECH.

[7]  Chung-Hsien Wu,et al.  Speech-Annotated Photo Retrieval Using Syllable-Transformed Patterns , 2009, IEEE Signal Processing Letters.

[8]  Jiri Pribil,et al.  Non-linear frequency scale mapping for voice conversion in text-to-speech system with cepstral description , 2006, Speech Commun..

[9]  Mübeccel Demirekler,et al.  Dynamic programming approach to voice transformation , 2005, Speech Commun..

[10]  Qin Yan,et al.  Voice conversion through transformation of spectral and intonation features , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Moncef Gabbouj,et al.  LSF mapping for voice conversion with very small training sets , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Hermann Ney,et al.  A study on residual prediction techniques for voice conversion , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[13]  Saeed Vaseghi,et al.  Evaluation of methods for parameteric formant transformation in voice conversion , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[14]  Chung-Hsien Wu,et al.  Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Hermann Ney,et al.  Vocal tract normalization equals linear transformation in cepstral space , 2001, IEEE Transactions on Speech and Audio Processing.

[16]  Chung-Hsien Wu,et al.  Conversion Function Clustering and Selection Using Linguistic and Spectral Information for Emotional Voice Conversion , 2007, IEEE Transactions on Computers.

[17]  Keiichi Tokuda,et al.  Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[18]  Md. Mahmud Hasan,et al.  An approach to voice conversion using feature statistical mapping , 2005 .

[19]  Zhiwei Shuang,et al.  Voice conversion by combining frequency warping with unit selection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Levent M. Arslan,et al.  Robust processing techniques for voice conversion , 2006, Comput. Speech Lang..

[21]  Antonio Galves Sonority as a basis for rhythmic class discrimination , 2002 .

[22]  Alexander Kain,et al.  Design and evaluation of a voice conversion algorithm based on spectral envelope mapping and residual prediction , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[23]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Tomoki Toda,et al.  Eigenvoice conversion based on Gaussian mixture model , 2006, INTERSPEECH.

[25]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[26]  Shih-Fu Chang,et al.  VisualSEEk: a fully automated content-based image query system , 1997, MULTIMEDIA '96.

[27]  Antonio Bonafonte,et al.  Including dynamic and phonetic information in voice conversion systems , 2004, INTERSPEECH.

[28]  Cheng-Lung Lee,et al.  Channel‐optimized error mitigation for distributed speech recognition over wireless networks , 2009 .

[29]  Sin-Horng Chen,et al.  Soft‐decision a priori knowledge interpolation for robust telephone speaker identification , 2009 .