Epoch-based analysis of speech signals

Speech analysis is traditionally performed using short-time analysis to extract features in time and frequency domains. The window size for the analysis is fixed somewhat arbitrarily, mainly to account for the time varying vocal tract system during production. However, speech in its primary mode of excitation is produced due to impulse-like excitation in each glottal cycle. Anchoring the speech analysis around the glottal closure instants (epochs) yields significant benefits for speech analysis. Epoch-based analysis of speech helps not only to segment the speech signals based on speech production characteristics, but also helps in accurate analysis of speech. It enables extraction of important acoustic-phonetic features such as glottal vibrations, formants, instantaneous fundamental frequency, etc. Epoch sequence is useful to manipulate prosody in speech synthesis applications. Accurate estimation of epochs helps in characterizing voice quality features. Epoch extraction also helps in speech enhancement and multispeaker separation. In this tutorial article, the importance of epochs for speech analysis is discussed, and methods to extract the epoch information are reviewed. Applications of epoch extraction for some speech applications are demonstrated.

[1]  Bayya Yegnanarayana,et al.  Combining evidence from residual phase and MFCC features for speaker recognition , 2006, IEEE Signal Processing Letters.

[2]  R. B. Monsen,et al.  Study of variations in the male and female glottal wave. , 1976, The Journal of the Acoustical Society of America.

[3]  Eric Moulines,et al.  A diphone synthesis system based on time-domain prosodic modifications of speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[4]  Punyam Satya-narayana,et al.  Short Segment Analysis Of Speech For Enhancement , 1999 .

[5]  Eduardo Lleida,et al.  A new method for epoch detection based on the Cohen's class of time frequency representations , 2001, IEEE Signal Processing Letters.

[6]  Bayya Yegnanarayana,et al.  Characterization of Glottal Activity From Speech Signals , 2009, IEEE Signal Processing Letters.

[7]  Douglas D. O'Shaughnessy,et al.  Automatic and reliable estimation of glottal closure instant and period , 1989, IEEE Trans. Acoust. Speech Signal Process..

[8]  Krzysztof Marasek,et al.  SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.

[9]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[10]  Shaila D. Apte,et al.  Speech and Audio Processing , 2012 .

[11]  E Abberton,et al.  First applications of a new laryngograph. , 1971, Medical & biological illustration.

[12]  John H. L. Hansen,et al.  Speech enhancement using a constrained iterative sinusoidal model , 2001, IEEE Trans. Speech Audio Process..

[13]  Steven M. Kay,et al.  Cochannel speaker separation by harmonic enhancement and suppression , 1997, IEEE Trans. Speech Audio Process..

[14]  A. Gray,et al.  Least squares glottal inverse filtering from the acoustic speech waveform , 1979 .

[15]  S. R. Mahadeva Prasanna,et al.  Determination of Instants of Significant Excitation in Speech Using Hilbert Envelope and Group Delay Function , 2007, IEEE Signal Processing Letters.

[16]  B. Yegnanarayana,et al.  Epoch extraction from linear prediction residual for identification of closed glottis interval , 1979 .

[17]  Douglas A. Reynolds,et al.  Modeling of the glottal flow derivative waveform with application to speaker identification , 1999, IEEE Trans. Speech Audio Process..

[18]  Xuejing Sun,et al.  Pitch determination and voice quality analysis using Subharmonic-to-Harmonic Ratio , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[20]  Mike Brookes,et al.  The DYPSA algorithm for estimation of glottal closure instants in voiced speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  P FABRE,et al.  [Percutaneous electric process registering glottic union during phonation: glottography at high frequency; first results]. , 1957, Bulletin de l'Academie nationale de medecine.

[22]  Vishu R. Viswanathan,et al.  Hands-free voice communication in an automobile with a microphone array , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Alan V. Oppenheim,et al.  Enhancement of speech by adaptive filtering , 1976, ICASSP.

[24]  John Hajek,et al.  A preliminary investigation of some acoustic characteristics of ejectives in Waima’a: VOT and closure duration , 2004 .

[25]  Marianne L. Borroff A landmark underspecification account of the patterning of glottal stop , 2007 .

[26]  Bayya Yegnanarayana,et al.  Speaker dependent mapping for low bit rate coding of throat microphone speech , 2009, INTERSPEECH.

[27]  Mike Brookes,et al.  A Quantitative Assessment of Group Delay Methods for Identifying Glottal Closures in Voiced Speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  B. Yegnanarayana Formant extraction from linear‐prediction phase spectra , 1978 .

[29]  Kuldip K. Paliwal,et al.  Speech Coding and Synthesis , 1995 .

[30]  S. R. Mahadeva Prasanna,et al.  Processing of reverberant speech for time-delay estimation , 2005, IEEE Transactions on Speech and Audio Processing.

[31]  B. Yegnanarayana,et al.  Perceived loudness of speech based on the characteristics of glottal excitation source. , 2009, The Journal of the Acoustical Society of America.

[32]  Boston Uni,et al.  Glottalization of word-initial vowels as a function of prosodic structure , 1996 .

[33]  K. Stevens Physics of Laryngeal Behavior and Larynx Modes , 1977, Phonetica.

[34]  Bayya Yegnanarayana,et al.  Determination of instants of significant excitation in speech using group delay function , 1995, IEEE Trans. Speech Audio Process..

[35]  Rafik A. Goubran,et al.  Speech enhancement using fourth-order cumulants and optimum filters in the subband domain , 2002, Speech Commun..

[36]  Bayya Yegnanarayana,et al.  Prosody modification using instants of significant excitation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[38]  B. Atal,et al.  Speech analysis and synthesis by linear prediction of the speech wave. , 1971, The Journal of the Acoustical Society of America.

[39]  Fabrice Plante,et al.  A pitch extraction reference database , 1995, EUROSPEECH.

[40]  Yves Kamp,et al.  A Frobenius norm approach to glottal closure detection from the speech signal , 1994, IEEE Trans. Speech Audio Process..

[41]  Peter Ladefoged,et al.  Phonation types: a cross-linguistic overview , 2001, J. Phonetics.

[42]  J. Bachorowski,et al.  The acoustic features of human laughter. , 2001, The Journal of the Acoustical Society of America.

[43]  J. Flanagan,et al.  Computer‐steered microphone arrays for sound transduction in large rooms , 1985 .

[44]  Bayya Yegnanarayana,et al.  A robust method for determining instants of major excitations in voiced speech , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[45]  B. Yegnanarayana,et al.  Epoch extraction of voiced speech , 1975 .

[46]  Bayya Yegnanarayana,et al.  Robustness of group-delay-based method for extraction of significant instants of excitation from speech signals , 1999, IEEE Trans. Speech Audio Process..

[47]  Patrick A. Naylor,et al.  Voice source parameters for speaker verification , 1998, 9th European Signal Processing Conference (EUSIPCO 1998).

[48]  Jean-Francois Cardoso,et al.  Blind signal separation: statistical principles , 1998, Proc. IEEE.

[49]  Bayya Yegnanarayana,et al.  Enhancement of reverberant speech using LP residual signal , 2000, IEEE Trans. Speech Audio Process..

[50]  Hynek Hermansky,et al.  Speech enhancement using linear prediction residual , 1999, Speech Commun..

[51]  Evelyn Abberton,et al.  Laryngographic assessment of normal voice: A tutorial , 1989 .

[52]  Bayya Yegnanarayana,et al.  Epoch Extraction From Speech Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[53]  Mike Brookes,et al.  Estimation of Glottal Closure Instants in Voiced Speech Using the DYPSA Algorithm , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[54]  D G Childers,et al.  Cochannel speech separation. , 1988, The Journal of the Acoustical Society of America.

[55]  Masato Miyoshi,et al.  Inverse filtering of room acoustics , 1988, IEEE Trans. Acoust. Speech Signal Process..

[56]  G. BAPINEEDU,et al.  ANALYSIS OF LOMBARD EFFECT SPEECH AND ITS APPLICATION IN SPEAKER VERIFICATION FOR IMPOSTER DETECTION , 2010 .

[57]  Wolf Leslau,et al.  Reference grammar of Amharic , 1998 .

[58]  Harvey F. Silverman,et al.  Some analysis of microphone arrays for speech data acquisition , 1987, IEEE Trans. Acoust. Speech Signal Process..

[59]  Kishore Prahallad,et al.  Significance of pitch synchronous analysis for speaker recognition using AANN models , 2010, INTERSPEECH.

[60]  Raymond N. J. Veldhuis,et al.  Extraction of vocal-tract system characteristics from speech signals , 1998, IEEE Trans. Speech Audio Process..

[61]  Bayya Yegnanarayana,et al.  Performance of an Event-Based Instantaneous Fundamental Frequency Estimator for Distant Speech Signals , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[62]  H. Mallidi,et al.  ANALYSIS OF SPEECH AT DIFFERENT SPEAKING RATES USING EXCITATION SOURCE INFORMATION , 2010 .

[63]  Yariv Ephraim,et al.  A signal subspace approach for speech enhancement , 1995, IEEE Trans. Speech Audio Process..

[64]  J. van den Berg Myoelastic-aerodynamic theory of voice production. , 1958, Journal of speech and hearing research.

[65]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[66]  T. W. Parsons Separation of speech from interfering speech by means of harmonic selection , 1976 .

[67]  Paul C. Bagshaw,et al.  Enhanced pitch tracking and the processing of F0 contours for computer aided intonation teaching , 1993, EUROSPEECH.

[68]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[69]  Bayya Yegnanarayana,et al.  Determining Number of Speakers From Multispeaker Speech Signals Using Excitation Source Information , 2007, IEEE Signal Processing Letters.

[70]  S. R. Mahadeva Prasanna,et al.  Study of robustness of zero frequency resonator method for extraction of fundamental frequency , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[71]  Wolfgang Hess,et al.  Accurate time-domain pitch determination of speech signals by means of a laryngograph , 1987, Speech Commun..

[72]  H. Strube Determination of the instant of glottal closure from the speech wave. , 1974, The Journal of the Acoustical Society of America.

[73]  Jun Huang,et al.  An energy-constrained signal subspace method for speech enhancement and recognition in colored noise , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[74]  Nam C. Phamdo,et al.  Signal/noise KLT based approach for enhancing speech degraded by colored noise , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[75]  John G. McKenna Automatic glottal closed-phase location and analysis by Kalman filtering , 2001, SSW.

[76]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[77]  G. H. Yates,et al.  Signal Processing for a Cocktail Party Effect , 1969 .

[78]  Hynek Hermansky,et al.  Processing linear prediction residual for speech enhancement , 1997, EUROSPEECH.

[79]  Bayya Yegnanarayana,et al.  Efficient representation of throat microphone speech , 2008, INTERSPEECH.

[80]  William S-Y. Wang,et al.  Vocal Physiology: Voice Production, Mechanisms and Functions , 1989 .

[81]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[82]  T. V. Ananthapadmanabha,et al.  Calculation of true glottal flow and its components , 1982, Speech Commun..

[83]  Jacqueline Vaissière,et al.  Phonological use of the larynx: a tutorial , 1994 .

[84]  D. Veeneman,et al.  Automatic glottal inverse filtering from speech and electroglottographic signals , 1985, IEEE Trans. Acoust. Speech Signal Process..

[85]  Pascal Scalart,et al.  A system for speech enhancement in the context of hands-free radiotelephony with combined noise reduction and acoustic echo cancellation , 1996, Speech Commun..

[86]  Christophe d'Alessandro,et al.  Robust glottal closure detection using the wavelet transform , 1999, EUROSPEECH.

[87]  D. J. Hermes,et al.  Measurement of pitch by subharmonic summation. , 1988, The Journal of the Acoustical Society of America.

[88]  L. H. Anauer,et al.  Speech Analysis and Synthesis by Linear Prediction of the Speech Wave , 2000 .

[89]  Douglas A. Reynolds,et al.  Measuring fine structure in speech: application to speaker identification , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[90]  J. Sundberg,et al.  Spectral correlates of glottal voice source waveform characteristics. , 1989, Journal of speech and hearing research.

[91]  Douglas D. O'Shaughnessy,et al.  Speech enhancement based conceptually on auditory evidence , 1991, IEEE Trans. Signal Process..

[92]  Allan Kardec Barros,et al.  Estimation of speech embedded in a reverberant and noisy environment by independent component analysis and wavelets , 2002, IEEE Trans. Neural Networks.

[93]  藤村 靖,et al.  Vocal physiology : voice production, mechanisms, and functions , 1988 .

[94]  K. S. R. Murty,et al.  Analysis of Stop Consonants in Indian Languages Using Excitation Source Information in Speech Signal , 2008 .

[95]  Randy G. Goldberg,et al.  A Practical Handbook of Speech Coders , 2000 .

[96]  Bayya Yegnanarayana,et al.  Event-Based Instantaneous Fundamental Frequency Estimation From Speech Signals , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[97]  P. Boersma ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FREQUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND , 1993 .

[98]  Athina P. Petropulu,et al.  Cepstrum-based deconvolution for speech dereverberation , 1996, IEEE Trans. Speech Audio Process..

[99]  Nick Campbell,et al.  Proceedings of the Interdisciplinary Workshop on The Phonetics of Laughter : Saarland University, Saarbrücken, Germany, 4-5 August 2007 , 2007 .