Yet Another Algorithm for Pitch Tracking

In this paper, we present a pitch detection algorithm that is extremely robust for both high quality and telephone speech. The kernel method for this algorithm is the “NCCF or Normalized Cross Correlation” reported by David Talkin [1]. Major innovations include: processing of the original acoustic signal and a nonlinearly processed version of the signal to partially restore very weak F0 components; intelligent peak picking to select multiple F0 candidates and assign merit factors; and, incorporation of highly rohust pitch contours obtained from smoothed versions of low frequency portions of spectrograms. Dynamic programming is used to find the “best” pitch track among all the candidates, using both local and transition costs. We evaluated our algorithm using the Keele pitch extraction reference database as “ground truth” for both “high quality” and “telephone” speech. For both types of speech, the error rates obtained are lower than the lowest reported in the literature.

[1]  Alex Acero,et al.  Maximum a posteriori pitch tracking , 1998, ICSLP.

[2]  Aaron E. Rosenberg,et al.  A comparative performance study of several pitch detection algorithms , 1976 .

[3]  Stephanie Seneff,et al.  Robust pitch tracking for prosodic modeling in telephone speech , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[4]  Paul Christopher Bagshaw,et al.  Automatic prosodic analysis for computer aided pronunciation teaching , 1994 .

[5]  Nick Roussopoulos,et al.  MOCHA: a self-extensible database middleware system for distributed data sources , 2000, SIGMOD 2000.

[6]  Ronald J. Baken,et al.  Clinical measurement of speech and voice , 1987 .

[7]  Fabrice Plante,et al.  A pitch extraction reference database , 1995, EUROSPEECH.

[8]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals , 1983 .

[9]  John Laver,et al.  Aspects of speech technology , 1988 .

[10]  Stephen A. Zahorian,et al.  Personal computer software vowel training aid for the hearing impaired , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[11]  Chao Huang,et al.  Large vocabulary Mandarin speech recognition with different approaches in modeling tones , 2000, INTERSPEECH.

[12]  Stephanie Seneff,et al.  A study of tones and tempo in continuous Mandarin digit strings and their application in telephone quality speech recognition , 1998, ICSLP.

[13]  Paul C. Bagshaw,et al.  Enhanced pitch tracking and the processing of F0 contours for computer aided intonation teaching , 1993, EUROSPEECH.

[14]  Mari Ostendorf,et al.  A Multi-level Model for Recognition of Intonation Labels , 1997, Computing Prosody.

[15]  José A. R. Fonollosa,et al.  A comparison of several recent methods of fundamental frequency and voicing decision estimation , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[16]  S. Knorr Reliable voiced/Unvoiced decision , 1979 .

[17]  Alan V. Oppenheim,et al.  Discrete representation of signals , 1972 .

[18]  Philip Lieberman,et al.  Speech Physiology, Speech Perception, and Acoustic Phonetics , 1988 .

[19]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[20]  Piero Cosi,et al.  On the use of autocorrelation for pitch extraction: Some statistical considerations and their application to the sift algorithm , 1984, Speech Commun..

[21]  Sara H. Basson,et al.  NTIMIT: a phonetically balanced, continuous speech, telephone bandwidth speech database , 1990, International Conference on Acoustics, Speech, and Signal Processing.