Event-Based Method for Instantaneous Fundamental Frequency Estimation from Voiced Speech Based on Eigenvalue Decomposition of the Hankel Matrix

We propose a robust event-based method for estimation of the instantaneous fundamental frequency of a voiced speech signal. The amplitude and frequency modulated (AM-FM) signal model of voiced speech in the low frequency range (LFR) indicates the presence of energy only around its instantaneous fundamental frequency ( F0) and its few harmonics. The time-varying F0 component of a voiced speech signal is extracted by a robust algorithm which iteratively performs eigenvalue decomposition (EVD) of the Hankel matrix, initially constructed from samples of the LFR filtered voiced speech signal. The negative cycles of the extracted time-varying F0 component provide a reliable coarse estimate of intervals where glottal closure instants (GCIs) may be present. The negative cycles of the LFR filtered voiced speech signal occurring within these intervals are isolated. There is a sudden decrease in the glottal impedance at GCIs resulting in high signal strength. Therefore, GCIs are detected as local minima in the derivative of the falling edges of the isolated negative cycles of the LFR filtered voiced speech signal, followed by a selection criterion to discard false GCI candidates. The instantaneous F0 is estimated as the inverse of the time interval between two consecutive GCIs. Experiments were performed on the Keele and CSTR speech databases in white and babble noise environments at various levels of degradation to assess the performance of the proposed method. The proposed method substantially reduces the gross F0 estimation errors in comparison to some state of the art methods.

[1]  K. S. Arun,et al.  Tracking the frequencies of superimposed time-varying harmonics , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[2]  W. Bastiaan Kleijn,et al.  Estimation of the Instantaneous Pitch of Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  J. Markel,et al.  The SIFT algorithm for fundamental frequency estimation , 1972 .

[4]  T. Irino,et al.  Robust and accurate fundamental frequency estimation based on dominant harmonic components. , 2004, The Journal of the Acoustical Society of America.

[5]  Antonio Ortega,et al.  Pitch period estimation using multipulse model and wavelet transform , 2007, INTERSPEECH.

[6]  Kaliappan Gopalan,et al.  A comparison of speaker identification results using features based on cepstrum and Fourier-Bessel expansion , 1999, IEEE Trans. Speech Audio Process..

[7]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[8]  J. Schroeder Signal Processing via Fourier-Bessel Series Expansion , 1993 .

[9]  P. Boersma Praat : doing phonetics by computer (version 4.4.24) , 2006 .

[10]  Rakesh Taori,et al.  Speech compression using pitch synchronous interpolation , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[11]  Douglas D. O'Shaughnessy,et al.  Automatic and reliable estimation of glottal closure instant and period , 1989, IEEE Trans. Acoust. Speech Signal Process..

[12]  Hai Huang,et al.  Speech pitch determination based on Hilbert-Huang transform , 2006, Signal Process..

[13]  Wei-Ping Zhu,et al.  Pitch Estimation Based on a Harmonic Sinusoidal Autocorrelation Model and a Time-Domain Matching Scheme , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[15]  M. Ross,et al.  Average magnitude difference function pitch extractor , 1974 .

[16]  Fabrice Plante,et al.  A pitch extraction reference database , 1995, EUROSPEECH.

[17]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals , 1983 .

[18]  K. Sreenivasa Rao,et al.  Voice conversion by mapping the speaker-specific features using pitch synchronous approach , 2010, Comput. Speech Lang..

[19]  Soo-Ngee Koh,et al.  Fundamental frequency determination based on instantaneous frequency estimation , 1995, Signal Process..

[20]  Jimmie Gilbert,et al.  Linear Algebra and Matrix Theory , 1991 .

[21]  Andreas Stolcke,et al.  Prosody-based automatic detection of annoyance and frustration in human-computer dialog , 2002, INTERSPEECH.

[22]  Pooja Jain,et al.  GCI identification from voiced speech using the eigen value decomposition of Hankel matrix , 2013, 2013 8th International Symposium on Image and Signal Processing and Analysis (ISPA).

[23]  Pooja Jain,et al.  Marginal energy density over the low frequency range as a feature for voiced/non-voiced detection in noisy speech signals , 2013, J. Frankl. Inst..

[24]  Ramdas Kumaresan,et al.  On decomposing speech into modulated components , 2000, IEEE Trans. Speech Audio Process..

[25]  Lawrence R. Rabiner,et al.  On the use of autocorrelation analysis for pitch detection , 1977 .

[26]  N. Huang,et al.  The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis , 1998, Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[27]  Paul C. Bagshaw,et al.  Enhanced pitch tracking and the processing of F0 contours for computer aided intonation teaching , 1993, EUROSPEECH.

[28]  Shubha Kadambe,et al.  Application of the wavelet transform for pitch detection of speech signals , 1992, IEEE Trans. Inf. Theory.

[29]  Hugo Leonardo Rufiner,et al.  A new algorithm for instantaneous F0 speech extraction based on Ensemble Empirical Mode Decomposition , 2009, 2009 17th European Signal Processing Conference.

[30]  Thierry Dutoit,et al.  Glottal closure and opening instant detection from speech signals , 2019, INTERSPEECH.

[31]  A. Noll Cepstrum pitch determination. , 1967, The Journal of the Acoustical Society of America.

[32]  Pradip Sircar,et al.  EEG signal analysis using FB expansion and second-order linear TVAR process , 2008, Signal Process..

[33]  Michael S. Scordilis,et al.  Analysis, enhancement and evaluation of five pitch determination techniques , 2002, Speech Commun..

[34]  Hajime Kobayashi,et al.  Weighted autocorrelation for pitch extraction of noisy speech , 2001, IEEE Trans. Speech Audio Process..

[35]  Tomohiro Nakatani,et al.  Harmonic sound stream segregation using localization and its application to speech stream segregation , 1999, Speech Commun..

[36]  Mohan M. Trivedi,et al.  2010 International Conference on Pattern Recognition Speech Emotion Analysis in Noisy Real-World Environment , 2022 .

[37]  Bayya Yegnanarayana,et al.  Event-Based Instantaneous Fundamental Frequency Estimation From Speech Signals , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  C Manfredi,et al.  A comparative analysis of fundamental frequency estimation methods with application to pathological voices. , 2000, Medical engineering & physics.

[39]  D. J. Hermes,et al.  Measurement of pitch by subharmonic summation. , 1988, The Journal of the Acoustical Society of America.

[40]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals: Algorithms and Devices , 1983 .

[41]  Andreas Stolcke,et al.  Modeling prosodic feature sequences for speaker recognition , 2005, Speech Commun..

[42]  David A. Krubsack,et al.  An autocorrelation pitch detector and voicing decision with confidence measures developed for noise-corrupted speech , 1991, IEEE Trans. Signal Process..

[43]  M. J. Cheng,et al.  Comparative performance study of several pitch detection algorithms , 1975 .

[44]  Pooja Jain,et al.  Time-Order Representation Based Method for Epoch Detection from Speech Signals , 2012, J. Intell. Syst..

[45]  K. Gopalan Pitch estimation using a modulation model of speech , 2000, WCC 2000 - ICSP 2000. 2000 5th International Conference on Signal Processing Proceedings. 16th World Computer Congress 2000.

[46]  Pradip Sircar,et al.  Analysis of multicomponent AM-FM signals using FB-DESA method , 2010, Digit. Signal Process..

[47]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[48]  Y. Kuroiwa,et al.  An improvement of LPC based on noise reduction using pitch synchronous addition , 1999, ISCAS'99. Proceedings of the 1999 IEEE International Symposium on Circuits and Systems VLSI (Cat. No.99CH36349).