Observation of empirical cumulative distribution of vowel spectral distances and its application to vowel based voice conversion

A simple and fast voice conversion method based only on vowel information is proposed. The proposed method relies on empirical distribution of perceptual spectral distances between representative examples of each vowel segment extracted using TANDEM-STRAIGHT spectral envelope estimation procedure [1]. Mapping functions of vowel spectra are designed to preserve vowel space structure defined by the observed empirical distribution while transforming position and orientation of the structure in an abstract vowel spectral space. By introducing physiological constraints in vocal tract shapes and vocal tract length normalization, difficulties in careful frequency alignment between vowel template spectra of the source and the target speakers can be alleviated without significant degradations in converted speech. The proposed method is a framebased instantaneous method and is relevant for real-time processing. Applications of the proposed method in-cross language voice conversion are also discussed. Index Terms: voice conversion, STRAIGHT, vowel structure, line spectral pair, vocal tract length

[1]  Satoshi Nakamura,et al.  Robust fundamental frequency estimation using instantaneous frequencies of harmonic components , 2000, INTERSPEECH.

[2]  Nobuaki Minematsu Mathematical evidence of the acoustic universal structure in speech , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[3]  Hideki Kawahara,et al.  Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[5]  Richard E. Turner,et al.  A statistical, formant-pattern model for segregating vowel type and vocal-tract length in developmental formant data. , 2009, The Journal of the Acoustical Society of America.

[6]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[7]  M. Unser Sampling-50 years after Shannon , 2000, Proceedings of the IEEE.

[8]  Keikichi Hirose,et al.  Para-Linguistic Information Represented as Distortion of the Acoustic Universal Structure In Speech , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[9]  Richard E. Turner,et al.  The processing and perception of size information in speech sounds. , 2005, The Journal of the Acoustical Society of America.