A Fast Method for High-Resolution Voiced/Unvoiced Detection and Glottal Closure/Opening Instant Estimation of Speech

We propose a fast speech analysis method which simultaneously performs high-resolution voiced/unvoiced detection (VUD) and accurate estimation of glottal closure and glottal opening instants (GCIs and GOIs, respectively). The proposed algorithm exploits the structure of the glottal flow derivative in order to estimate GCIs and GOIs only in voiced speech using simple time-domain criteria. We compare our method with well-known GCI/GOI methods, namely, the dynamic programming projected phase-slope algorithm (DYPSA), the yet another GCI/GOI algorithm (YAGA) and the speech event detection using the residual excitation and a mean-based signal (SEDREAMS). Furthermore, we examine the performance of the aforementioned methods when combined with state-of-the-art VUD algorithms, namely, the robust algorithm for pitch tracking (RAPT) and the summation of residual harmonics (SRH). Experiments conducted on the APLAWD and SAM databases show that the proposed algorithm outperforms the state-of-the-art combinations of VUD and GCI/GOI algorithms with respect to almost all evaluation criteria for clean speech. Experiments on speech contaminated with several noise types (white Gaussian, babble, and car-interior) are also presented and discussed. The proposed algorithm outperforms the state-of-the-art combinations in most evaluation criteria for signal-to-noise ratio greater than 10 dB.

[1]  Junichi Yamagishi,et al.  HMM-based speech synthesiser using the LF-model of the glottal source , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Christophe d'Alessandro,et al.  Robust glottal closure detection using the wavelet transform , 1999, EUROSPEECH.

[3]  Lawrence R. Rabiner,et al.  A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition , 1976 .

[4]  Paavo Alku,et al.  Glottal wave analysis with Pitch Synchronous Iterative Adaptive Inverse Filtering , 1991, Speech Commun..

[5]  G. Fant Acoustic theory of speech production : with calculations based on X-ray studies of Russian articulations , 1961 .

[6]  A. G. Ramakrishnan,et al.  Epoch Extraction Based on Integrated Linear Prediction Residual Using Plosion Index , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  A. Gray,et al.  Least squares glottal inverse filtering from the acoustic speech waveform , 1979 .

[8]  Mike Brookes,et al.  The DYPSA algorithm for estimation of glottal closure instants in voiced speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  P.A. Naylor,et al.  Spatiotemporal Averagingmethod for Enhancement of Reverberant Speech , 2007, 2007 15th International Conference on Digital Signal Processing.

[10]  Mike Brookes,et al.  Estimation of Glottal Closure Instants in Voiced Speech Using the DYPSA Algorithm , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Bayya Yegnanarayana,et al.  Robustness of group-delay-based method for extraction of significant instants of excitation from speech signals , 1999, IEEE Trans. Speech Audio Process..

[12]  Nathalie Henrich Bernardoni,et al.  The spectrum of glottal flow models , 2006 .

[13]  T.H. Crystal,et al.  Linear prediction of speech , 1977, Proceedings of the IEEE.

[14]  Abeer Alwan,et al.  Glottal source processing: From analysis to applications , 2014, Comput. Speech Lang..

[15]  P. Alku,et al.  Closed phase covariance analysis based on constrained linear prediction for glottal inverse filtering. , 2009, The Journal of the Acoustical Society of America.

[16]  A. Rosenberg Effect of glottal pulse shape on the quality of natural vowels. , 1969, The Journal of the Acoustical Society of America.

[17]  Per Hedelin High quality glottal LPC-vocoding , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  David M. Howard,et al.  Conditioned variability in voicing offsets , 1988, IEEE Trans. Acoust. Speech Signal Process..

[19]  Wolfgang Hess,et al.  Accurate time-domain pitch determination of speech signals by means of a laryngograph , 1987, Speech Commun..

[20]  B. Yegnanarayana,et al.  Epoch extraction from linear prediction residual for identification of closed glottis interval , 1979 .

[21]  Andreas Spanias,et al.  Cepstrum-based pitch detection using a new statistical V/UV classification algorithm , 1999, IEEE Trans. Speech Audio Process..

[22]  Patrick A. Naylor,et al.  Estimation of Glottal Closing and Opening Instants in Voiced Speech Using the YAGA Algorithm , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  D G Childers,et al.  Vocal quality factors: analysis, synthesis, and perception. , 1991, The Journal of the Acoustical Society of America.

[24]  Thomas Quatieri,et al.  Discrete-Time Speech Signal Processing: Principles and Practice , 2001 .

[25]  E Vilkman,et al.  Effects of bandwidth on glottal airflow waveforms estimated by inverse filtering. , 1995, The Journal of the Acoustical Society of America.

[26]  Eric G. Hansen,et al.  Glottal modeling and closed-phase analysis for speaker recognition , 2004, Odyssey.

[27]  J. Liljencrants,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .

[28]  Ananthapadmanabha,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report Acoustic Analysis of Voice Source Dynamics , 2022 .

[29]  H. Strube Determination of the instant of glottal closure from the speech wave. , 1974, The Journal of the Acoustical Society of America.

[30]  A. Koutrouvelis Speech Production Modelling and Analysis , 2014 .

[31]  L. Siegel A procedure for using pattern classification techniques to obtain a voiced/Unvoiced classifier , 1979 .

[32]  Donald G. Childers,et al.  Silent and voiced/unvoiced/mixed excitation (four-way) classification of speech , 1989, IEEE Trans. Acoust. Speech Signal Process..

[33]  Mike Brookes,et al.  Voice source cepstrum coefficients for speaker identification , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Patrick A. Naylor,et al.  Detection of Glottal Closure Instants From Speech Signals: A Quantitative Review , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Douglas A. Reynolds,et al.  Modeling of the glottal flow derivative waveform with application to speaker identification , 1999, IEEE Trans. Speech Audio Process..

[36]  Patrick A. Naylor,et al.  Multi-microphone speech dereverberation using spatio-temporal averaging , 2004, 2004 12th European Signal Processing Conference.

[37]  Mike Brookes,et al.  PEFAC - A Pitch Estimation Algorithm Robust to High Levels of Noise , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[38]  Patrick A. Naylor,et al.  The SIGMA Algorithm: A Glottal Activity Detector for Electroglottographic Signals , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Thierry Dutoit,et al.  Using a pitch-synchronous residual codebook for hybrid HMM/frame selection speech synthesis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[41]  Noureddine Ellouze,et al.  Glottal opening instant detection from speech signal , 2004, 2004 12th European Signal Processing Conference.

[42]  Bayya Yegnanarayana,et al.  Determination of instants of significant excitation in speech using group delay function , 1995, IEEE Trans. Speech Audio Process..

[43]  G. P. Moore,et al.  Electroglottography and vocal fold physiology. , 1990, Journal of speech and hearing research.

[44]  Thierry Dutoit,et al.  Glottal closure and opening instant detection from speech signals , 2019, INTERSPEECH.

[45]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[46]  B. Yegnanarayana,et al.  Epoch extraction of voiced speech , 1975 .

[47]  Abeer Alwan,et al.  Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics , 2019, INTERSPEECH.

[48]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[49]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[50]  Patrick A. Naylor,et al.  Data-driven voice soruce waveform modelling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[51]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[52]  Saeed Vaseghi,et al.  Transformation of speaker characteristics for voice conversion , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[53]  Yves Kamp,et al.  A Frobenius norm approach to glottal closure detection from the speech signal , 1994, IEEE Trans. Speech Audio Process..

[54]  A. Gray,et al.  A spectral-flatness measure for studying the autocorrelation method of linear prediction of speech analysis , 1974 .

[55]  Khalid Daoudi,et al.  Detection of Glottal Closure Instants Based on the Microcanonical Multiscale Formalism , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[56]  Olivier Rosec,et al.  ARX-LF-based source-filter methods for voice modification and transformation , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[57]  George P. Kafentzis On the Inverse Filtering of Speech , 2010 .