Intelligibility Rating with Automatic Speech Recognition, Prosodic, and Cepstral Evaluation

For voice rehabilitation, speech intelligibility is an important criterion. Automatic evaluation of intelligibility has been shown to be successful for automatic speech recognition methods combined with prosodic analysis. In this paper, this method is extended by using measures based on the Cepstral Peak Prominence (CPP). 73 hoarse patients (48.3±16.8 years) uttered the vowel /e/ and read the German version of the text "The North Wind and the Sun". Their intelligibility was evaluated perceptually by 5 speech therapists and physicians according to a 5-point scale. Support Vector Regression (SVR) revealed a feature set with a human-machine correlation of r = 0.85 consisting of the word accuracy, smoothed CPP computed from a speech section, and three prosodic features (normalized energy of word-pause-word intervals, F0 value at voice offset in a word, and standard deviation of jitter). The average human-human correlation was r = 0.82. Hence, the automatic method can be a meaningful objective support for perceptual analysis.

[1]  Georg Stemmer Modeling variability in speech recognition , 2004 .

[2]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[3]  Elmar Nöth,et al.  VERBMOBIL: the use of prosody in the linguistic components of a speech understanding system , 2000, IEEE Trans. Speech Audio Process..

[4]  R. Ruben,et al.  Redefining the survival of the fittest: communication disorders in the 21st century. , 1999, International journal of pediatric otorhinolaryngology.

[5]  Jeung-Yoon Choi,et al.  Prosody dependent speech recognition on radio news corpus of American English , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Benjamin Halberstam Acoustic and Perceptual Parameters Relating to Connected Speech Are More Reliable Measures of Hoarseness than Parameters Relating to Sustained Vowels , 2004, ORL.

[7]  Ana Dembitz,et al.  Speech of children with cleft palate , 2010 .

[8]  Andreas Stolcke,et al.  Direct Modeling of Prosody: An Overview of Applications in Automatic Speech Processing , 2004 .

[9]  Tino Haderlein,et al.  Automatic evaluation of tracheoesophageal substitute voices , 2007 .

[10]  Elmar Nöth,et al.  The Prosody Module , 2006, SmartKom.

[11]  Ian Witten,et al.  Data Mining , 2000 .

[12]  Nelson Roy,et al.  Outcomes measurement in voice disorders: application of an acoustic index of dysphonia severity. , 2009 .

[13]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[14]  D. Jamieson,et al.  Acoustic discrimination of pathological voice: sustained vowels versus continuous speech. , 2001, Journal of speech, language, and hearing research : JSLHR.

[15]  J. Hillenbrand,et al.  Acoustic correlates of breathy vocal quality: dysphonic voices and continuous speech. , 1996, Journal of speech and hearing research.

[16]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[17]  P. Van cauwenberge,et al.  Acoustic measurement of overall voice quality: a meta-analysis. , 2009, The Journal of the Acoustical Society of America.