Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT

A new control paradigm of source signals for high quality speech synthesis is introduced to handle a variety of speech quality, based on timefrequency analyses by the use of an instantaneous frequency and group delay. The proposed signal representation consists of a frequency domain aperiodicity measure and a time domain energy concentration measure to represent source attributes, which supplement the conventional source information, such as F0 and power. The frequency domain aperiodicity measure is defined as a ratio between the lower and upper smoothed spectral envelopes to represent the relative energy distribution of aperiodic components. The time domain measure is defined as an effective duration of the aperiodic component. These aperiodicity parameters and F0 as time functions are used to generate the source signal for synthetic speech by controlling relative noise levels and the temporal envelope of the noise component of the mixed mode excitation signal, including fine timing and amplitude fluctuations. A series of preliminary simulation experiments was conducted to test and to demonstrate consistency of the proposed method. Examples sung in different voice qualities were also analyzed and resynthesized using the proposed method.

[1]  O. Fujimura An approximation to voice aperiodicity , 1968 .

[2]  F. Cesbron Pitch detection using the short-term phase spectrum , 1992 .

[3]  Martin Cooke,et al.  Modelling auditory processing and organisation , 1993, Distinguished dissertations in computer science.

[4]  Takao Kobayashi,et al.  Harmonics Estimation Based on Instantaneous Frequency and Its Application to Pitch Determination of Speech , 1995, IEICE Trans. Inf. Syst..

[5]  Hideki Kawahara,et al.  Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Roy D. Patterson,et al.  Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity , 1999, EUROSPEECH.

[7]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[8]  Satoshi Nakamura,et al.  Robust fundamental frequency estimation using instantaneous frequencies of harmonic components , 2000, INTERSPEECH.

[9]  Hideki Kawahara,et al.  Accurate vocal event detection method based on a fixed-point analysis of mapping from time to weighted average group delay , 2000, INTERSPEECH.

[10]  Jan Skoglund,et al.  On time-frequency masking in voiced speech , 2000, IEEE Trans. Speech Audio Process..

[11]  A. Bregman Auditory Scene Analysis , 2001 .