Evaluation and optimization of F0-adaptive spectral envelope estimation based on spectral smoothing with peak emphasis

A new spectral estimation method which improves processed sound quality of STRAIGHT, a speech analysis, modification and re-synthesis framework widely used for high-quality speech and singing manipulations, is proposed. Application of the proposed method to TANDEM-STRAIGHT, a completely reformulated version of STRAIGHT, yielded the best spectral envelope approximation among conventional methods such as LPC, cepstrum and legacy-STRAIGHT. TANDEM-STRAIGHT consists of two parts, a temporarily stable power spectrum estimation method of periodic signals (TANDEM) and a spectral envelope calculation method based on consistent sampling theory. The proposed method uses F0-adaptive smoothing and compensation of logarithmic power spectrum, for improving approximation accuracy of spectral peaks, which effects on the quality of re-synthesized sound. A series of simulations was conducted to optimize internal parameters of the proposed method. The optimized system was evaluated and compared with conventional methods using stylized spectra and simulated speech spectra. The evaluation was based on a spectral distance measure proposed by Itakura and Saitou with modification to perceptually relevant ERBNnumber frequency axis. The following set of spectra were used. Power spectra calculated from vocal tract area functions measured using MRI data with LF-model excitation spectra were used as the grand truth and spectral distances between this target and the estimated spectra were evaluated. A set of periodic pulse train was used for excitation signal in this case. These evaluation results indicated that the proposed method yields the smallest spectrum distance among conventional methods such as LPC, cepstrum and legacy-STRAIGHT.

[1]  O. Fujimura,et al.  Sweep-tone measurements of vocal-tract characteristics. , 1971, The Journal of the Acoustical Society of America.

[2]  M. Unser Sampling-50 years after Shannon , 2000, Proceedings of the IEEE.

[3]  B. Moore An Introduction to the Psychology of Hearing , 1977 .

[4]  D G Childers,et al.  Modeling the glottal volume-velocity waveform for three voice types. , 1995, The Journal of the Acoustical Society of America.

[5]  E. Owens Introduction to the Psychology of Hearing , 1977 .

[6]  Morise Masanori,et al.  Effects of spectral envelope representations on resynthesized speech quality , 2009 .

[7]  Hideki Kawahara,et al.  Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[9]  Hideki Kawahara,et al.  Spectral envelope recovery beyond the nyquist limit for high-quality manipulation of speech sounds , 2008, INTERSPEECH.

[10]  J. Liljencrants,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .

[11]  Brad H Story,et al.  Comparison of magnetic resonance imaging-based vocal tract area functions obtained from the same speaker in 1994 and 2002. , 2008, The Journal of the Acoustical Society of America.