The Analysis of Voice Quality in Speech Processing

Voice quality has been defined as the characteristic auditory colouring of an individual's voice, derived from a variety of laryngeal and supralaryngeal features and running continuously through the individual's speech. The distinctive tone of speech sounds produced by a particular person yields a particular voice. Voice quality is at the centre of several speech processing issues. In speech recognition, voice differences, particularly extreme divergences from the norm, are responsible for known performance degradations. In speech synthesis on the other hand, voice quality is a desirable modelling parameter, with millions of voice types that can be distinguished theoretically. This article reviews the experimental derivation of voice quality markers. Specifically, the use of perceptual judgements, the long-term averaged spectrum (LTAS) and prosodic markers is examined, as well as inverse filtering for the extraction of the glottal source waveform. This review suggests that voice quality is best investigated as a multi-dimensional parameter space involving a combination of factors involving individual prosody, temporally structured speech characteristics, spectral divergence and voice source features, and that it could profitably complement simple linguistic prosodic model processing in speech synthesis.

[1]  O Hallén,et al.  Evaluation of Teflon injection procedures for paralytic dysphonia. , 1974, Folia phoniatrica.

[2]  Lou Boves,et al.  Fitting a LF-model to inverse filter signals , 1993, EUROSPEECH.

[3]  Qiang Fu,et al.  A robust glottal source model estimation technique , 2004, INTERSPEECH.

[4]  J. Pittam Voice in Social Interaction: An Interdisciplinary Approach , 1994 .

[5]  Zellner Keller,et al.  Prosodic Styles and Personality Styles: are the two interrelated , 2004 .

[6]  Gudrun Klasmeyer,et al.  AN AUTOMATIC DESCRIPTION TOOL FOR TIME CONTOURS AND LONG-TERM AVERAGE VOICE FEATURES IN LARGE EMOTIONAL SPEECH DATABASES , 2000 .

[7]  J. Laver,et al.  The handbook of phonetic sciences , 1999 .

[8]  Laurent Besacier Un modèle parallèle pour la reconnaissance automatique du locuteur , 1998 .

[9]  Adrian Fourcin,et al.  Electrolaryngographic assessment of vocal fold function , 1986 .

[11]  Elisabeth Zetterholm,et al.  Music and Hearing Quarterly Progress and Status Report A comparative survey of phonetic features of two impersonators , 2007 .

[12]  C. Gobl,et al.  Expressive synthesis: how crucial is voice quality? , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[13]  M. Rothenberg A new inverse-filtering technique for deriving the glottal air flow waveform during voicing. , 1970, The Journal of the Acoustical Society of America.

[14]  J. Laver The phonetic description of voice quality , 1980 .

[15]  Gunnar Fant,et al.  Vocal tract area functions of Swedish vowels and a new three-parameter model , 1992, ICSLP.

[16]  J. Sundberg,et al.  Effect on LTAS of vocal loudness variation , 2004, Logopedics, phoniatrics, vocology.

[17]  C. Gobl The Voice Source in Speech Communication - Production and Perception Experiments Involving Inverse Filtering and Synthesis , 2003 .

[18]  Gunnar Fant,et al.  Glottal flow: models and interaction , 1986 .

[19]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[20]  Douglas A. Reynolds,et al.  Modeling of the glottal flow derivative waveform with application to speaker identification , 1999, IEEE Trans. Speech Audio Process..

[21]  John G. McKenna Automatic glottal closed-phase location and analysis by Kalman filtering , 2001, SSW.