Generalized Likelihood Ratio Test for Voiced-Unvoiced Decision in Noisy Speech Using the Harmonic Model

In this paper, a novel method for voiced-unvoiced decision within a pitch tracking algorithm is presented. Voiced-unvoiced decision is required for many applications, including modeling for analysis/synthesis, detection of model changes for segmentation purposes and signal characterization for indexing and recognition applications. The proposed method is based on the generalized likelihood ratio test (GLRT) and assumes colored Gaussian noise with unknown covariance. Under voiced hypothesis, a harmonic plus noise model is assumed. The derived method is combined with a maximum a-posteriori probability (MAP) scheme to obtain a pitch and voicing tracking algorithm. The performance of the proposed method is tested using several speech databases for different levels of additive noise and phone speech conditions. Results show that the GLRT is robust to speaker and environmental conditions and performs better than existing algorithms.

[1]  Jeffrey L. Krolik,et al.  Relationships between adaptive minimum variance beamforming and optimal source localization , 2000, IEEE Trans. Signal Process..

[2]  Thomas F. Quatieri,et al.  Speech transformations based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[3]  Mirjam Wester,et al.  An elitist approach to articulatory-acoustic feature classification , 2001, INTERSPEECH.

[4]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[5]  Juan L. Navarro-Mesa,et al.  A time-frequency approach to epoch detection , 1995, EUROSPEECH.

[6]  Tomohiro Nakatani,et al.  Dominance spectrum based v/UV classification and f_0 estimation , 2003, INTERSPEECH.

[7]  Yannis Stylianou,et al.  Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[8]  Jean Rouat,et al.  A pitch determination and voiced/unvoiced decision algorithm for noisy speech , 1995, Speech Commun..

[9]  Xavier Serra,et al.  Musical Sound Modeling with Sinusoids plus Noise , 1997 .

[10]  Fabrice Plante,et al.  A pitch extraction reference database , 1995, EUROSPEECH.

[11]  Yves Kamp,et al.  A Frobenius norm approach to glottal closure detection from the speech signal , 1994, IEEE Trans. Speech Audio Process..

[12]  José A. R. Fonollosa,et al.  A comparison of several recent methods of fundamental frequency and voicing decision estimation , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[13]  Steven Greenberg,et al.  An elitist approach to automatic articulatory-acoustic feature classification for phonetic characterization of spoken language , 2005, Speech Commun..

[14]  Shlomo Dubnov,et al.  Maximum a-posteriori probability pitch tracking in noisy environments using harmonic model , 2004, IEEE Transactions on Speech and Audio Processing.

[15]  Lippold Haken,et al.  A New Algorithm for Bandwidth Association in Bandwidth-Enhanced Additive Sound Modeling , 2000, ICMC.

[16]  Rafael A. Irizarry The Additive Sinusoidal Plus Residual Model: A Statistical Analysis , 1998, ICMC.

[17]  Andreas Spanias,et al.  Cepstrum-based pitch detection using a new statistical V/UV classification algorithm , 1999, IEEE Trans. Speech Audio Process..

[18]  Tomohiro Nakatani,et al.  Dominance spectrum based V / UV class , 2003 .

[19]  Lucas C. Parra,et al.  Approximate Kalman filtering for the harmonic plus noise model , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[20]  Sara H. Basson,et al.  NTIMIT: a phonetically balanced, continuous speech, telephone bandwidth speech database , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[21]  Ignasi Esquerra Llucià,et al.  A time-frequency approach to epoch detection , 1995 .