Improved automatic detection of creak

This paper describes a new algorithm for automatically detecting creak in speech signals. Detection is made by utilising two new acoustic parameters which are designed to characterise creaky excitations following previous evidence in the literature combined with new insights from observations in the current work. In particular the new method focuses on features in the Linear Prediction (LP) residual signal including the presence of secondary peaks as well as prominent impulse-like excitation peaks. These parameters are used as input features to a decision tree classifier for identifying creaky regions. The algorithm was evaluated on a range of read and conversational speech databases and was shown to clearly outperform the state-of-the-art. Further experiments involving degradations of the speech signal demonstrated robustness to both white and babble noise, providing better results than the state-of-the-art down to at least 20dB signal to noise ratio.

[1]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[2]  Carol Y. Espy-Wilson,et al.  Automatic detection of irregular phonation in continuous speech , 2006, INTERSPEECH.

[3]  Ikuko Patricia Yuasa,et al.  CREAKY VOICE: A NEW FEMININE VOICE QUALITY FOR YOUNG URBAN-ORIENTED UPWARDLY MOBILE AMERICAN WOMEN? , 2010 .

[4]  Harry Hollien,et al.  On vocal registers , 1974 .

[5]  Janet Slifka,et al.  Is irregular phonation a reliable cue towards the segmentation of continuous speech in American English , 2006 .

[6]  Richard Ogden Turn transition, creak and glottal stop in Finnish talk-in-interaction , 2001, Journal of the International Phonetic Association.

[7]  John H. Esling,et al.  The valves of the throat and their functioning in tone, vocal register and stress: laryngoscopic case studies , 2006, Phonology.

[8]  N. Campbell,et al.  Voice Quality : the 4 th Prosodic Dimension , 2004 .

[9]  J. Elliott,et al.  THE APPLICATION OF A BAYESIAN APPROACH TO AUDITORY ANALYSIS IN FORENSIC SPEAKER IDENTIFICATION , 2002 .

[10]  Ailbhe Ní Chasaide,et al.  Voice quality and f0 cues for affect expression: implications for synthesis , 2005, INTERSPEECH.

[11]  Géza Németh,et al.  Automatic Classification of Regular vs. Irregular Phonation Types , 2009, NOLISP.

[12]  Thomas Magnuson,et al.  Realizations of /r/ in Japanese Talk-in-Interaction , 2011, ICPhS.

[13]  Shrikanth S. Narayanan,et al.  Robust Voice Activity Detection Using Long-Term Signal Variability , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  J. Laver The phonetic description of voice quality , 1980 .

[15]  Axel Röbel,et al.  Phase Minimization for Glottal Model Estimation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[17]  John Kane,et al.  Modeling the Creaky Excitation for Parametric Speech Synthesis , 2012, INTERSPEECH.

[18]  Stefanie Shattuck-Hufnagel Listeners Recognize Speakers ’ Habitual Utterance-Final Voice Quality , 2007 .

[19]  Moncef Gabbouj,et al.  Parameterization of vocal fry in HMM-based speech synthesis , 2009, INTERSPEECH.

[20]  Scott Moisik,et al.  The 'Whole Larynx' Approach to Laryngeal Features , 2011, ICPhS.

[21]  John Kane,et al.  Resonator-based creaky voice detection , 2012, INTERSPEECH.

[22]  Axel Röbel,et al.  Improving Lpc Spectral Envelope Extraction Of Voiced Speech By True-Envelope Estimation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[23]  Ailbhe Ní Chasaide,et al.  The role of voice quality in communicating emotion, mood and attitude , 2003, Speech Commun..

[24]  Abeer Alwan,et al.  Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics , 2019, INTERSPEECH.

[25]  Hiroshi Ishiguro,et al.  Automatic extraction of paralinguistic information using prosodic features related to F , 2008, Speech Commun..

[26]  H Hollien,et al.  Perceptual study of vocal fry. , 1968, The Journal of the Acoustical Society of America.

[27]  Jens Edlund,et al.  Spontal: A Swedish Spontaneous Dialogue Corpus of Audio, Video and Motion Capture , 2010, LREC.

[28]  I R Titze,et al.  Vocal intensity in speakers and singers. , 1991, The Journal of the Acoustical Society of America.

[29]  Günther Palm,et al.  Multimodal Laughter Detection in Natural Discourses , 2009, Human Centered Robot Systems, Cognition, Interaction, Technology.

[30]  Carol Y. Espy-Wilson,et al.  A new set of features for text-independent speaker identification , 2006, INTERSPEECH.

[31]  Rolf Carlson,et al.  Cues for hesitation in speech synthesis , 2006, INTERSPEECH.

[32]  M. Ng,et al.  Acoustic, aerodynamic, physiologic, and perceptual properties of modal and vocal fry registers. , 1998, The Journal of the Acoustical Society of America.

[33]  Patrick A. Naylor,et al.  Detection of Glottal Closure Instants From Speech Signals: A Quantitative Review , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Martti Vainio,et al.  Artificial Neural Network Based Prosody Models for Finnish Text-to-Speech Synthesis , 2001 .

[35]  Nassima B. Abdelli-Beruh,et al.  Habitual use of vocal fry in young adult female speakers. , 2012, Journal of voice : official journal of the Voice Foundation.

[36]  John Kane,et al.  Identifying Regions of Non-Modal Phonation Using Features of the Wavelet Transform , 2011, INTERSPEECH.

[37]  Janet Slifka,et al.  Acoustic cues for the classification of regular and irregular phonation , 2006, INTERSPEECH.

[38]  Kristine M. Yu,et al.  The Role of Creaky Voice in Cantonese Tonal Perception , 2014, ICPhS.

[39]  Thierry Dutoit,et al.  Complex cepstrum-based decomposition of speech for glottal source estimation , 2009, INTERSPEECH.

[40]  Thierry Dutoit,et al.  Oscillating Statistical Moments for Speech Polarity Detection , 2011, NOLISP.

[41]  Ariel Salomon,et al.  Use of temporal information: detection of periodicity, aperiodicity, and pitch in speech , 2005, IEEE Transactions on Speech and Audio Processing.

[42]  Janet Slifka,et al.  Some physiological correlates to regular and irregular phonation at the end of an utterance. , 2006, Journal of voice : official journal of the Voice Foundation.

[43]  Steven Kay,et al.  Modern Spectral Estimation: Theory and Application , 1988 .

[44]  Hiroshi Ishiguro,et al.  Proposal of acoustic measures for automatic detection of vocal fry , 2005, INTERSPEECH.

[45]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[46]  Carlos Toshinori Ishi Analysis of Autocorrelation-based Parameters for Creaky Voice Detection , 2003 .

[47]  Mattias Heldner,et al.  Very Short Utterances and Timing in Turn-Taking , 2011, INTERSPEECH.

[48]  John Laver,et al.  Principles of Phonetics: Principles of transcription , 1994 .

[49]  Jody Kreiman,et al.  Toward a taxonomy of nonmodal phonation , 2001, J. Phonetics.

[50]  Christer Gobl,et al.  Acoustic characteristics of voice quality , 1992, Speech Commun..

[51]  Hiroshi Ishiguro,et al.  A Method for Automatic Detection of Vocal Fry , 2008, IEEE Transactions on Audio, Speech, and Language Processing.