Error Evaluation of an F0-Adaptive Spectral Envelope Estimator in Robustness against the Additive Noise and F0 Error

This paper describes an evaluation of a temporally stable spectral envelope estimator proposed in our past research. The past research demonstrated that the proposed algorithm can synthesize speech that is as natural as the input speech. This paper focuses on an objective comparison, in which the proposed algorithm is compared with two modern estimation algorithms in terms of estimation performance and temporal stability. The results show that the proposed algorithm is superior to the others in both aspects. key words: speech analysis, spectral envelope, F0-adaptive analysis, timevarying component

[1]  Hideki Kawahara,et al.  Temporally variable multi-aspect auditory morphing enabling extrapolation without objective and perceptual breakdown , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  M. Mathews,et al.  Pitch Synchronous Analysis of Voiced Sounds , 1961 .

[3]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[4]  Hideki Kawahara,et al.  v.morish'09: A Morphing-Based Singing Design Interface for Vocal Melodies , 2009, ICEC.

[5]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[6]  Masataka Goto,et al.  A spectral envelope estimation method based on F0-adaptive multi-frame integration analysis , 2012, SAPA@INTERSPEECH.

[7]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Masanori Morise,et al.  CheapTrick, a spectral envelope estimator for high-quality speech synthesis , 2015, Speech Commun..

[9]  Tomoki Toda,et al.  Excitation source analysis for high-quality speech manipulation systems based on an interference-free representation of group delay with minimum phase response compensation , 2014, INTERSPEECH.

[10]  Masanori Morise PLATINUM: A method to extract excitation signals for voice synthesis system , 2012 .

[11]  M. Unser Sampling-50 years after Shannon , 2000, Proceedings of the IEEE.

[12]  HIDEKI KAWAHARA,et al.  Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework , 2011 .

[13]  Keiichi Tokuda,et al.  Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory HMM , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Hideki Kawahara,et al.  Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[17]  Hideki Kawahara,et al.  Implementation of realtime STRAIGHT speech manipulation system: Report on its first implementation , 2007 .