Robust pitch tracking for prosodic modeling in telephone speech

In this paper, we introduce a pitch detection algorithm that is particularly robust for telephone speech and prosodic modeling. The algorithm uses a logarithmically sampled spectral representation of speech, similar to that in the subharmonic summation approach. Constraints for logF/sub 0/ and /spl Delta/logF/sub 0/ are combined in a dynamic programming search to find an optimum pitch track. The search algorithm is able to find a continuous pitch contour regardless of the voicing status, while a separate voicing decision module computes the probability of voicing per frame. We evaluated the algorithm using the Keele pitch extraction reference database under both studio and telephone conditions. Our algorithm is very robust to channel degradation, and compares favorably to XWAVES under telephone conditions. It also significantly outperforms XWAVES when used for tone classification on a telephone quality Mandarin digit corpus.

[1]  M. J. Cheng,et al.  Comparative performance study of several pitch detection algorithms , 1975 .

[2]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals , 1983 .

[3]  Y. Sagisaka,et al.  Speech synthesis by rule using an optimal selection of non-uniform synthesis units , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[4]  D. J. Hermes,et al.  Measurement of pitch by subharmonic summation. , 1988, The Journal of the Acoustical Society of America.

[5]  Fabrice Plante,et al.  A pitch extraction reference database , 1995, EUROSPEECH.

[6]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  Alex Acero,et al.  Maximum a posteriori pitch tracking , 1998, ICSLP.

[8]  James Glass,et al.  Evaluation methodology for a telephone-based conversational system , 1998 .

[9]  Stephanie Seneff,et al.  A study of tones and tempo in continuous Mandarin digit strings and their application in telephone quality speech recognition , 1998, ICSLP.

[10]  Victor Zue,et al.  GALAXY-II: a reference architecture for conversational system development , 1998, ICSLP.

[11]  James R. Glass,et al.  Real-time telephone-based speech recognition in the Jupiter domain , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[12]  Timothy J. Hazen,et al.  ACOUSTIC MODELING IMPROVEMENTS IN A SEGMENT-BASED SPEECH RECOGNIZER , 1999 .

[13]  Mehryar Mohri,et al.  Rapid unit selection from a large speech corpus for concatenative speech synthesis , 1999, EUROSPEECH.

[14]  Kenney Ng Information fusion for spoken document retrieval , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[15]  Joseph Polifroni,et al.  Galaxy-II as an Architecture for Spoken Dialogue Evaluation , 2000, LREC.

[16]  James R. Glass,et al.  Lexical modeling of non-native speech for automatic speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[17]  Michelle S. Spina Analysis and transcription of general audio data , 2000 .

[18]  Kenney Ng,et al.  Subword-based approaches for spoken document retrieval , 2000, Speech Commun..

[19]  Grace Yuet-Chee Chung Towards multi-domain speech understanding with flexible and dynamic vocabulary , 2001 .