Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT

A new method for source information extraction is proposed. The aim of the method is to provide optimal source information for the very high quality speech manipulation system STRAIGHT. The method is based on both time interval and frequency cues, and it provides fundamental frequency and periodicity information within each frequency band, to allow mixed mode excitation. The method is designed to minimize perceptual disturbance due to errors in source information extraction. A preliminary evaluation using a database of simultaneously recorded EGG and speech signals yielded very low gross error rates (0.029% for females and 0.14% for males). In addition, the method is designed so as to minimize the perceptual disturbance caused by any such gross error.

[1]  Satoshi Nakamura,et al.  Robust fundamental frequency estimation using instantaneous frequencies of harmonic components , 2000, INTERSPEECH.

[2]  Peter F Assmann,et al.  Synthesis fidelity and time-varying spectral change in vowels. , 2005, The Journal of the Acoustical Society of America.

[3]  Diane Kewley-Port,et al.  Vowel formant discrimination for high-fidelity speech. , 2004, The Journal of the Acoustical Society of America.

[4]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[5]  Hideki Kawahara,et al.  Acappella synthesis demonstrations using RWC music database , 2004, NIME.

[6]  Roy D. Patterson,et al.  Segregating information about the size and shape of the vocal tract using a time-domain auditory model: The stabilised wavelet-Mellin transform , 2002, Speech Commun..

[7]  Roy D. Patterson,et al.  Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity , 1999, EUROSPEECH.

[8]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[9]  Hideki Kawahara,et al.  Auditory morphing based on an elastic perceptual distance metric in an interference-free time-frequency representation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[10]  Richard E. Turner,et al.  The processing and perception of size information in speech sounds. , 2005, The Journal of the Acoustical Society of America.

[11]  Hideki Kawahara,et al.  Algorithm amalgam: morphing waveform based methods, sinusoidal models and STRAIGHT , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Hideki Kawahara,et al.  Investigation of emotionally morphed speech perception and its structure using a high quality speech manipulation system , 2003, INTERSPEECH.