Automatic intelligibility assessment of pathologic speech over the telephone

Abstract Objective assessment of intelligibility on the telephone is desirable for voice and speech assessment and rehabilitation. A total of 82 patients after partial laryngectomy read a standardized text which was synchronously recorded by a headset and via telephone. Five experienced raters assessed intelligibility perceptually on a five-point scale. Objective evaluation was performed by support vector regression on the word accuracy (WA) and word correctness (WR) of a speech recognition system, and a set of prosodic features. WA and WR alone exhibited correlations to human evaluation between |r| = 0.57 and |r| = 0.75. The correlation was r = 0.79 for headset and r = 0.86 for telephone recordings when prosodic features and WR were combined. The best feature subset was optimal for both signal qualities. It consists of WR, the average duration of the silent pauses before a word, the standard deviation of the fundamental frequency on the entire sample, the standard deviation of jitter, and the ratio of the durations of the voiced sections and the entire recording.

[1]  Ian H. Witten,et al.  Weka: Practical machine learning tools and techniques with Java implementations , 1999 .

[2]  Jay G. Wilpon,et al.  A study of speech recognition for children and the elderly , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[3]  N. Hogikyan,et al.  Validation of an instrument to measure voice-related quality of life (V-RQOL). , 1999, Journal of voice : official journal of the Voice Foundation.

[4]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[5]  Elmar Nöth,et al.  How to find trouble in communication , 2003, Speech Commun..

[6]  Yannis Stylianou,et al.  On combining information from modulation spectra and mel-frequency cepstral coefficients for automatic detection of pathological voices , 2011, Logopedics, phoniatrics, vocology.

[7]  Ana Dembitz,et al.  Speech of children with cleft palate , 2010 .

[8]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[9]  Ibon Saratxaga,et al.  Detection of synthetic speech for the problem of imposture , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Wolzt,et al.  World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects. , 2003, The Journal of the American College of Dentists.

[11]  Paavo Alku,et al.  Automatic pre-segmentation of running speech improves the robustness of several acoustic voice measures , 2003, Logopedics, phoniatrics, vocology.

[12]  Georg Stemmer Modeling variability in speech recognition , 2004 .

[13]  Heinrich Niemann,et al.  Automatic speech recognition without phonemes , 1993, EUROSPEECH.

[14]  World Medical Association (WMA),et al.  Declaration of Helsinki. Ethical Principles for Medical Research Involving Human Subjects , 2009, Journal of the Indian Medical Association.

[15]  Irma Verdonck-de Leeuw,et al.  Acoustical analysis of tracheoesophageal voice , 2005, Speech Commun..

[16]  Thierry Dutoit,et al.  On the Use of the Correlation between Acoustic Descriptors for the Normal/Pathological Voices Discrimination , 2009, EURASIP J. Adv. Signal Process..

[17]  Elmar Nöth,et al.  An Automatic Version of the Post-Laryngectomy Telephone Test , 2007, TSD.

[18]  Elmar Nöth,et al.  Intelligibility of laryngectomees’ substitute speech: automatic speech recognition and subjective rating , 2005, European Archives of Oto-Rhino-Laryngology and Head & Neck.

[19]  Jeung-Yoon Choi,et al.  Prosody dependent speech recognition on radio news corpus of American English , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Yannis Stylianou,et al.  Voice Pathology Detection Based eon Short-Term Jitter Estimations in Running Speech , 2009, Folia Phoniatrica et Logopaedica.

[21]  Shrikanth S. Narayanan,et al.  An automatic prosody recognizer using a coupled multi-stream acoustic model and a syntactic-prosodic language model , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[22]  Elmar Nöth,et al.  Automatic evaluation of prosodic features of tracheoesophageal substitute voice , 2007, European Archives of Oto-Rhino-Laryngology.

[23]  P. Dejonckere,et al.  A basic protocol for functional assessment of voice pathology, especially for investigating the efficacy of (phonosurgical) treatments and evaluating new assessment techniques , 2001, European Archives of Oto-Rhino-Laryngology.

[24]  Elmar Nöth,et al.  The Prosody Module , 2006, SmartKom.

[25]  Karthikeyan Umapathy,et al.  Discrimination of pathological voices using a time-frequency approach , 2005, IEEE Transactions on Biomedical Engineering.

[26]  Viveka Lyberg Åhlander,et al.  Automatic speech recognition (ASR) and its use as a tool for assessment or therapy of voice, speech, and language disorders , 2009, Logopedics, phoniatrics, vocology.

[27]  H. P. Zenner The Postlaryngectomy Telephone Intelligibility Test (PLTT) , 1986 .

[28]  I. F. Herrmann,et al.  Speech Restoration Via Voice Prostheses , 1986, Springer Berlin Heidelberg.

[29]  Elmar Nöth,et al.  VERBMOBIL: the use of prosody in the linguistic components of a speech understanding system , 2000, IEEE Trans. Speech Audio Process..

[30]  Philip de Chazal,et al.  Telephony-based voice pathology assessment using automated speech analysis , 2006, IEEE Transactions on Biomedical Engineering.

[31]  Elmar Nöth,et al.  Automatic Speech Recognition Systems for the Evaluation of Voice and Speech Disorders in Head and Neck Cancer , 2010, EURASIP J. Audio Speech Music. Process..

[32]  M.K.C. MacMahon International Phonetic Association , 2006 .

[33]  Benjamin Halberstam Acoustic and Perceptual Parameters Relating to Connected Speech Are More Reliable Measures of Hoarseness than Parameters Relating to Sustained Vowels , 2004, ORL.

[34]  Elmar Nöth,et al.  Integrated recognition of words and prosodic phrase boundaries , 2002, Speech Commun..

[35]  Elmar Nöth,et al.  Application of Automatic Speech Recognition to Quantitative Assessment of Tracheoesophageal Speech with Different Signal Quality , 2008, Folia Phoniatrica et Logopaedica.

[36]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[37]  Douglas E. Sturim,et al.  Automatic dysphonia recognition using biologically-inspired amplitude-modulation features , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..