Harvest: A High-Performance Fundamental Frequency Estimator from Speech Signals

A fundamental frequency (F0) estimator named Harvest is described. The unique points of Harvest are that it can obtain a reliable F0 contour and reduce the error that the voiced section is wrongly identified as the unvoiced section. It consists of two steps: estimation of F0 candidates and generation of a reliable F0 contour on the basis of these candidates. In the first step, the algorithm uses fundamental component extraction by many band-pass filters with different center frequencies and obtains the basic F0 candidates from filtered signals. After that, basic F0 candidates are refined and scored by using the instantaneous frequency, and then several F0 candidates in each frame are estimated. Since the frame-by-frame processing based on the fundamental component extraction is not robust against temporally local noise, a connection algorithm using neighboring F0s is used in the second step. The connection takes advantage of the fact that the F0 contour does not precipitously change in a short interval. We carried out an evaluation using two speech databases with electroglottograph (EGG) signals to compare Harvest with several state-of-the-art algorithms. Results showed that Harvest achieved the best performance of all algorithms.

[1]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[2]  Masanori Morise,et al.  D4C, a band-aperiodicity estimator for high-quality speech synthesis , 2016, Speech Commun..

[3]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[4]  Kai Yu,et al.  Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  A. Noll Short‐Time Spectrum and “Cepstrum” Techniques for Vocal‐Pitch Detection , 1964 .

[6]  M. Ross,et al.  Average magnitude difference function pitch extractor , 1974 .

[7]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals , 1983 .

[9]  Heiga Zen,et al.  Using instantaneous frequency and aperiodicity detection to estimate F0 for high-quality speech synthesis , 2016, SSW.

[10]  Hideki Kawahara,et al.  TUSK: A Framework for Overviewing the Performance of F0 Estimators , 2016, INTERSPEECH.

[11]  Hideki Kawahara,et al.  Implementation of realtime STRAIGHT speech manipulation system: Report on its first implementation , 2007 .

[12]  John G Harris,et al.  A sawtooth waveform inspired pitch estimator for speech and music. , 2008, The Journal of the Acoustical Society of America.

[13]  Hideki Kawahara,et al.  v.morish'09: A Morphing-Based Singing Design Interface for Vocal Melodies , 2009, ICEC.

[14]  Bayya Yegnanarayana,et al.  Event-Based Instantaneous Fundamental Frequency Estimation From Speech Signals , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[16]  Paul C. Bagshaw,et al.  Enhanced pitch tracking and the processing of F0 contours for computer aided intonation teaching , 1993, EUROSPEECH.

[17]  HIDEKI KAWAHARA,et al.  Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework , 2011 .

[18]  A. Nuttall Some windows with very good sidelobe behavior , 1981 .

[19]  Hideki Kawahara,et al.  Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT , 2005, INTERSPEECH.

[20]  Kou Tanaka,et al.  A Hybrid Approach to Electrolaryngeal Speech Enhancement Based on Noise Reduction and Statistical Excitation Generation , 2014, IEICE Trans. Inf. Syst..

[21]  Masanori Morise,et al.  Error Evaluation of an F0-Adaptive Spectral Envelope Estimator in Robustness against the Additive Noise and F0 Error , 2015, IEICE Trans. Inf. Syst..

[22]  J. L. Flanagan,et al.  PHASE VOCODER , 2008 .

[23]  T. Irino,et al.  Robust and accurate fundamental frequency estimation based on dominant harmonic components. , 2004, The Journal of the Acoustical Society of America.

[24]  Hajime Kobayashi,et al.  Weighted autocorrelation for pitch extraction of noisy speech , 2001, IEEE Trans. Speech Audio Process..

[25]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[26]  Simon Dixon,et al.  PYIN: A fundamental frequency estimator using probabilistic threshold distributions , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  M. Mathews,et al.  Pitch Synchronous Analysis of Voiced Sounds , 1961 .

[28]  Masataka Goto,et al.  A spectral envelope estimation method based on F0-adaptive multi-frame integration analysis , 2012, SAPA@INTERSPEECH.

[29]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Masanori Morise,et al.  CheapTrick, a spectral envelope estimator for high-quality speech synthesis , 2015, Speech Commun..

[31]  Aaron E. Rosenberg,et al.  A comparative performance study of several pitch detection algorithms , 1976 .

[32]  Hideki Kawahara,et al.  Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  A. Noll Cepstrum pitch determination. , 1967, The Journal of the Acoustical Society of America.

[34]  Hideki Kawahara,et al.  Fast and Reliable F0 Estimation Method Based on the Period Extraction of Vocal Fold Vibration of Singing Voice and Speech , 2009 .

[35]  Yuji Hisaminato,et al.  A Fast and Accurate Fundamental Frequency Estimator Using Recursive Moving Average Filters , 2016, INTERSPEECH.