TUSK: A Framework for Overviewing the Performance of F0 Estimators

This article presents a framework for overviewing the performance of fundamental frequency (F0) estimators and evaluates its effectiveness. Over the past few decades, many F0 estimators and evaluation indices have been proposed and have been evaluated using various speech databases. In speech analysis/synthesis research, modern estimators are used as the algorithm to fulfill the demand for high-quality speech synthesis, but at the same time, they are competing with one another on minor issues. Specifically, while all of them meet the demands for high-quality speech synthesis, the result depends on the speech database used in the evaluation. Since there are various types of speech, it is inadvisable to discuss the effectiveness of each estimator on the basis of minor differences. It would be better to select the appropriate F0 estimator in accordance with the speech characteristics. The framework we propose, TUSK, does not rank the estimators but rather attempts to overview them. In TUSK, six parameters are introduced to observe the trends in the characteristics in each F0 estimator. The signal is artificially generated so that six parameters can be controllable independently. In this article, we introduce the concept of TUSK and determine its effectiveness using several modern F0 estimators.

[1]  M. Ross,et al.  Average magnitude difference function pitch extractor , 1974 .

[2]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[3]  T. Irino,et al.  Robust and accurate fundamental frequency estimation based on dominant harmonic components. , 2004, The Journal of the Acoustical Society of America.

[4]  Hideki Kawahara,et al.  Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT , 2005, INTERSPEECH.

[5]  A. Noll Cepstrum pitch determination. , 1967, The Journal of the Acoustical Society of America.

[6]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals , 1983 .

[7]  Takao Kobayashi,et al.  Harmonics tracking and pitch extraction based on instantaneous frequency , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[8]  Simon Dixon,et al.  PYIN: A fundamental frequency estimator using probabilistic threshold distributions , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Hideki Kawahara SparkNG: Interactive MATLAB Tools for Introduction to Speech Production, Perception and Processing Fundamentals and Application of the Aliasing-Free L-F Model Component , 2016, INTERSPEECH.

[10]  HIDEKI KAWAHARA,et al.  Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework , 2011 .

[11]  Hajime Kobayashi,et al.  Weighted autocorrelation for pitch extraction of noisy speech , 2001, IEEE Trans. Speech Audio Process..

[12]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[13]  Masataka Goto,et al.  A spectral envelope estimation method based on F0-adaptive multi-frame integration analysis , 2012, SAPA@INTERSPEECH.

[14]  Masanori Morise,et al.  CheapTrick, a spectral envelope estimator for high-quality speech synthesis , 2015, Speech Commun..

[15]  John G Harris,et al.  A sawtooth waveform inspired pitch estimator for speech and music. , 2008, The Journal of the Acoustical Society of America.

[16]  Masanori Morise,et al.  Error Evaluation of an F0-Adaptive Spectral Envelope Estimator in Robustness against the Additive Noise and F0 Error , 2015, IEICE Trans. Inf. Syst..

[17]  Aaron E. Rosenberg,et al.  A comparative performance study of several pitch detection algorithms , 1976 .

[18]  D. Klatt,et al.  Analysis, synthesis, and perception of voice quality variations among female and male talkers. , 1990, The Journal of the Acoustical Society of America.

[19]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[20]  A. Noll Short‐Time Spectrum and “Cepstrum” Techniques for Vocal‐Pitch Detection , 1964 .

[21]  Hideki Kawahara,et al.  Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Bayya Yegnanarayana,et al.  Event-Based Instantaneous Fundamental Frequency Estimation From Speech Signals , 2009, IEEE Transactions on Audio, Speech, and Language Processing.