Robust fundamental frequency estimation in sustained vowels: detailed algorithmic comparisons and information fusion with adaptive Kalman filtering.

There has been consistent interest among speech signal processing researchers in the accurate estimation of the fundamental frequency (F(0)) of speech signals. This study examines ten F(0) estimation algorithms (some well-established and some proposed more recently) to determine which of these algorithms is, on average, better able to estimate F(0) in the sustained vowel /a/. Moreover, a robust method for adaptively weighting the estimates of individual F(0) estimation algorithms based on quality and performance measures is proposed, using an adaptive Kalman filter (KF) framework. The accuracy of the algorithms is validated using (a) a database of 117 synthetic realistic phonations obtained using a sophisticated physiological model of speech production and (b) a database of 65 recordings of human phonations where the glottal cycles are calculated from electroglottograph signals. On average, the sawtooth waveform inspired pitch estimator and the nearly defect-free algorithms provided the best individual F(0) estimates, and the proposed KF approach resulted in a ∼16% improvement in accuracy over the best single F(0) estimation algorithm. These findings may be useful in speech signal processing applications where sustained vowels are used to assess vocal quality, when very accurate F(0) estimation is required.

[1]  Max A. Little,et al.  Objective Automatic Assessment of Rehabilitative Speech Treatment in Parkinson's Disease , 2014, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[2]  Diana Torres,et al.  Using dynamic time warping of T0 contours in the evaluation of cycle-to-cycle Pitch Detection Algorithms , 2008, Pattern Recognit. Lett..

[3]  Patrick A. Naylor,et al.  The SIGMA Algorithm: A Glottal Activity Detector for Electroglottographic Signals , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Ingo R. Titze,et al.  Principles of voice production , 1994 .

[5]  Hideki Kawahara,et al.  Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT , 2005, INTERSPEECH.

[6]  Max A. Little,et al.  Nonlinear speech analysis algorithms mapped to a standard metric achieve clinically useful quantification of average Parkinson's disease symptom severity , 2011, Journal of The Royal Society Interface.

[7]  H. Herzel,et al.  Bifurcations in an asymmetric vocal-fold model. , 1995, The Journal of the Acoustical Society of America.

[8]  Hideki Kawahara,et al.  Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Athanasios Tsanas,et al.  Accurate telemonitoring of Parkinson's disease symptom severity using nonlinear speech signal processing and statistical machine learning , 2012 .

[10]  Young-Ro Yoon,et al.  Evaluation of Performance of Several Established Pitch Detection Algorithms in Pathological Voices , 2007, 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[11]  Mads Græsbøll Christensen On the estimation of low fundamental frequencies , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[12]  Leonardo Bocchi,et al.  erturbation measurements in highly irregular voice signals : erformances / validity of analysis software tools , 2011 .

[13]  Daryush D. Mehta,et al.  Investigating acoustic correlates of human vocal fold vibratory phase asymmetry through modeling and laryngeal high-speed videoendoscopy. , 2011, The Journal of the Acoustical Society of America.

[14]  Jitendra R. Raol,et al.  Multi-Sensor Data Fusion with MATLAB® , 2009 .

[15]  P. Boersma ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FREQUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND , 1993 .

[16]  G. R. Wodicka,et al.  A theoretical model of the pressure field arising from asymmetric intraglottal flows applied to a two-mass model of the vocal folds. , 2011, The Journal of the Acoustical Society of America.

[17]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[18]  Shamim Nemati,et al.  Data Fusion for Improved Respiration Rate Estimation , 2010, EURASIP J. Adv. Signal Process..

[19]  R. Mehra On the identification of variances and adaptive Kalman filtering , 1970 .

[20]  Rick M Roark,et al.  Frequency and voice: perspectives in the time domain. , 2006, Journal of voice : official journal of the Voice Foundation.

[21]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[22]  Andreas Jakobsson,et al.  Multi-Pitch Estimation , 2009, Multi-Pitch Estimation.

[23]  John G Harris,et al.  A sawtooth waveform inspired pitch estimator for speech and music. , 2008, The Journal of the Acoustical Society of America.

[24]  Paul Boersma,et al.  Should jitter be measured by peak picking or by waveform matching , 2009 .

[25]  I. Titze,et al.  Voice simulation with a body-cover model of the vocal folds. , 1995, The Journal of the Acoustical Society of America.

[26]  R G Mark,et al.  Robust heart rate estimation from multiple asynchronous noisy sources using signal quality indices and a Kalman filter , 2008, Physiological measurement.

[27]  Xuejing Sun,et al.  Pitch determination and voice quality analysis using Subharmonic-to-Harmonic Ratio , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[28]  I. Titze,et al.  Comparison of Fo extraction methods for high-precision voice perturbation measurements. , 1993, Journal of speech and hearing research.

[29]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[30]  H. J. Kuo Voice source modeling and analysis of speakers with vocal-fold nodules , 1997 .

[31]  Max A. Little,et al.  New nonlinear markers and insights into speech signal degradation for effective tracking of Parkinson ’ s disease symptom severity , 2011 .

[32]  J. Perkell,et al.  Objective assessment of vocal hyperfunction: an experimental framework and initial results. , 1989, Journal of speech and hearing research.

[33]  I. Titze,et al.  Rules for controlling low-dimensional vocal fold models with muscle activation. , 2002, The Journal of the Acoustical Society of America.

[34]  Mike Brookes,et al.  Estimation of Glottal Closure Instants in Voiced Speech Using the DYPSA Algorithm , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Max A. Little,et al.  Testing the assumptions of linear prediction analysis in normal vowels. , 2006, The Journal of the Acoustical Society of America.

[36]  Pedro Gómez Vilda,et al.  Dimensionality Reduction of a Pathological Voice Quality Assessment System Based on Gaussian Mixture Models and Short-Term Cepstral Parameters , 2006, IEEE Transactions on Biomedical Engineering.

[37]  Roy D. Patterson,et al.  Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity , 1999, EUROSPEECH.

[38]  B. Doval,et al.  On the use of the derivative of electroglottographic signals for characterization of nonpathological phonation. , 2004, The Journal of the Acoustical Society of America.

[39]  I. Titze Nonlinear source-filter coupling in phonation: theory. , 2008, The Journal of the Acoustical Society of America.

[40]  D G Jamieson,et al.  A comparison of high precision F0 extraction algorithms for sustained vowels. , 1999, Journal of speech, language, and hearing research : JSLHR.

[41]  Matias Zanartu Salas,et al.  Acoustic coupling in phonation and its effect on inverse filtering of oral airflow and neck surface acceleration , 2010 .

[42]  Diana Torres,et al.  Using dynamic time warping of T0 contours in the evaluation of cycle-to-cycle Pitch Detection Algorithms , 2010, Pattern Recognit. Lett..

[43]  Max A. Little,et al.  Novel Speech Signal Processing Algorithms for High-Accuracy Classification of Parkinson's Disease , 2012, IEEE Transactions on Biomedical Engineering.

[44]  R. Colton,et al.  Problems and pitfalls of electroglottography , 1990 .