On the detection of pitch marks using a robust multi-phase algorithm

A large number of methods for identifying glottal closure instants (GCIs) in voiced speech have been proposed in recent years. In this paper, we propose to take advantage of both glottal and speech signals in order to increase the accuracy of detection of GCIs. All aspects of this particular issue, from determining speech polarity to handling a delay between glottal and corresponding speech signal, are addressed. A robust multi-phase algorithm (MPA), which combines different methods applied on both signals in a unique way, is presented. Within the process, a special attention is paid to determination of speech waveform polarity, as it was found to be considerably influencing the performance of the detection algorithms. Another feature of the proposed method is that every detected GCI is given a confidence score, which allows to locate potentially inaccurate GCI subsequences. The performance of the proposed algorithm was tested and compared with other freely available GCI detection algorithms. The MPA algorithm was found to be more robust in terms of detection accuracy over various sets of sentences, languages and phone classes. Finally, some pitfalls of the GCI detection are discussed.

[1]  Takashi Saitoh,et al.  An automatic pitch-marking method using wavelet transform , 2000, INTERSPEECH.

[2]  Eric Moulines,et al.  A diphone synthesis system based on time-domain prosodic modifications of speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[3]  Bayya Yegnanarayana,et al.  Robustness of group-delay-based method for extraction of significant instants of excitation from speech signals , 1999, IEEE Trans. Speech Audio Process..

[4]  Yannis Stylianou,et al.  Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[5]  J. Schoentgen,et al.  Decomposition of vocal cycle length perturbations into vocal jitter and vocal microtremor, and comparison of their size in normophonic speakers. , 2003, Journal of voice : official journal of the Voice Foundation.

[6]  Mark Huckvale,et al.  Improvements in Speech Synthesis , 2001 .

[7]  Thierry Dutoit,et al.  On the use of a hybrid harmonic/stochastic model for TTS synthesis-by-concatenation , 1996, Speech Commun..

[8]  Jindrich Matousek,et al.  Design of speech corpus for text-to-speech synthesis , 2001, INTERSPEECH.

[9]  Werner Verhelst,et al.  An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[11]  Martin Rothenberg A multichannel electroglottograph , 1992 .

[12]  Gernot Kubin,et al.  Poincaré pitch marks , 2006, Speech Commun..

[13]  Jyh-Shing Roger Jang,et al.  A two-phase pitch marking method for TD-PSOLA synthesis , 2004, INTERSPEECH.

[14]  H. Strube Determination of the instant of glottal closure from the speech wave. , 1974, The Journal of the Acoustical Society of America.

[15]  Daniel Tihelka,et al.  Recent improvements on ARTIC: czech text-to-speech system , 2004, INTERSPEECH.

[16]  Jindrich Matousek,et al.  Automatic pitch-synchronous phonetic segmentation , 2008, INTERSPEECH.

[17]  Hussein Hussein,et al.  Hybrid electroglottograph and speech signal based algorithm for pitch marking , 2007, INTERSPEECH.

[18]  Nick Campbell,et al.  Determining polarity of speech signals based on gradient of spurious glottal waveforms , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[19]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[20]  Beat Kleiner,et al.  Graphical Methods for Data Analysis , 1983 .

[21]  Jau-Hung Chen,et al.  Pitch Marking Based on an Adaptable Filter and a Peak-Valley Estimation Method , 2001, ROCLING/IJCLCLP.

[22]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[23]  Bayya Yegnanarayana,et al.  A robust method for determining instants of major excitations in voiced speech , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[24]  Daniel Tihelka,et al.  Pitch Marks at Peaks or Valleys? , 2007, TSD.

[25]  W. Kleijn,et al.  Enhancement of coded speech by constrained optimization , 2002, Speech Coding, 2002, IEEE Workshop Proceedings..

[26]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[27]  Christophe d'Alessandro,et al.  Robust glottal closure detection using the wavelet transform , 1999, EUROSPEECH.

[28]  M. Rothenberg,et al.  Monitoring vocal fold abduction through vocal fold contact area. , 1988, Journal of speech and hearing research.

[29]  Jindrich Matousek,et al.  F0 transformation within the voice conversion framework , 2007, INTERSPEECH.

[30]  Mike Brookes,et al.  A Quantitative Assessment of Group Delay Methods for Identifying Glottal Closures in Voiced Speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Daniel Tihelka,et al.  A robust multi-phase pitch-mark detection algorithm , 2007, INTERSPEECH.

[32]  Yves Kamp,et al.  A Frobenius norm approach to glottal closure detection from the speech signal , 1994, IEEE Trans. Speech Audio Process..

[33]  John G. McKenna Automatic glottal closed-phase location and analysis by Kalman filtering , 2001, SSW.

[34]  Thierry Dutoit Corpus-Based Speech Synthesis , 2008 .

[35]  Elmar Nöth,et al.  On the use of prosody in automatic dialogue understanding , 2002, Speech Commun..

[36]  Jacob Benesty,et al.  Springer handbook of speech processing , 2007, Springer Handbooks.

[37]  Hussein Hussein,et al.  A hybrid speech signal based algorithm for pitch marking using finite state machines , 2008, INTERSPEECH.

[38]  Jerome R. Bellegarda A novel discontinuity metric for unit selection text-to-speech synthesis , 2004, SSW.

[39]  Daniel Tihelka,et al.  Building of a Speech Corpus Optimised for Unit Selection TTS Synthesis , 2008, LREC.

[40]  Carmen García Mateo,et al.  Concatenative Text‐to‐Speech Synthesis Based on Sinusoidal Modelling , 2002 .

[41]  Bayya Yegnanarayana,et al.  Determination of instants of significant excitation in speech using group delay function , 1995, IEEE Trans. Speech Audio Process..