Disordered Speech Assessment Using Kullback-Leibler Divergence Features with Multi-Task Acoustic Modeling

For acoustical assessment of pathological speech, naturally spoken sentences are believed to be most suitable from the perspectives of both patients and clinicians. This is a challenging problem, as the extraction of pathology-dependent features is not straightforward. Previous research showed that features derived from lattice posteriors and decoding results of automatic speech recognition (ASR) could be used to quantifying various types of speech impairments. This paper describes a novel feature that can be derived from phone posterior probabilities generated by an ASR system. The Kullback-Leibler (KL) divergence is used to measure the phone-level distortion between unimpaired and impaired speakers. A Cantonese ASR system is trained with a combination of normal and impaired speech corpora. The multi-task learning approach is applied in order to incorporate different speech characteristics. Experimental results show that the proposed KL divergence feature is effective in the continuous speech based assessment of different pathologies, including voice disorder and post-stroke aphasia. The KL divergence feature is found to outperform conventional acoustic features and supra-segmental duration features, and is complementary to text features in quantifying language impairment. Index Terms: disordered speech assessment, voice disorders, aphasia, continuous speech, KL divergence, ASR, multi-task learning

[1]  Haifeng Li,et al.  A KL divergence and DNN approach to cross-lingual TTS , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[3]  Guozhen An,et al.  Automatic recognition of unified parkinson's disease rating from speech with acoustic, i-vector and phonotactic features , 2015, INTERSPEECH.

[4]  Tan Lee,et al.  On the Linguistic Relevance of Speech Units Learned by Unsupervised Acoustic Modeling , 2017, INTERSPEECH.

[5]  Hervé Bourlard,et al.  Using KL-based acoustic models in a large vocabulary recognition task , 2008, INTERSPEECH.

[6]  Soren Y Lowell,et al.  Spectral- and cepstral-based measures during continuous speech: capacity to distinguish dysphonia and consistency within a speaker. , 2010, Journal of voice : official journal of the Voice Foundation.

[7]  Rahul Gupta,et al.  Automatic estimation of parkinson's disease severity from diverse speech tasks , 2015, INTERSPEECH.

[8]  Mikko Kurimo,et al.  Aalto system for the 2017 Arabic multi-genre broadcast challenge , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[9]  Yuanyuan Liu,et al.  Predicting Severity of Voice Disorder from DNN-HMM Acoustic Posteriors , 2016, INTERSPEECH.

[10]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[11]  Tan Lee,et al.  Improving Cross-Lingual Knowledge Transferability Using Multilingual TDNN-BLSTM with Language-Dependent Pre-Final Layer , 2018, INTERSPEECH.

[12]  Thomas Law,et al.  Comparison of Rater's reliability on perceptual evaluation of different types of voice sample. , 2012, Journal of voice : official journal of the Voice Foundation.

[13]  Tan Lee,et al.  Spoken language resources for Cantonese speech processing , 2002, Speech Commun..

[14]  Rubén San-Segundo-Hernández,et al.  Random forest-based prediction of parkinson's disease progression using acoustic, ASR and intelligibility features , 2015, INTERSPEECH.

[15]  Rahul Gupta,et al.  Pathological speech processing: State-of-the-art, current challenges, and future directions , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Haifeng Li,et al.  A KL Divergence and DNN-Based Approach to Voice Conversion without Parallel Training Sentences , 2016, INTERSPEECH.

[17]  Jen-Tzung Chien,et al.  Automatic speech recognition for acoustical analysis and assessment of cantonese pathological voice and speech , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Yuanyuan Liu,et al.  Acoustic Assessment of Disordered Voice with Continuous Speech Based on Utterance-Level ASR Posterior Features , 2017, INTERSPEECH.

[19]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[20]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[21]  Gary Geunbae Lee,et al.  Information gain and divergence-based feature selection for machine learning-based text categorization , 2006, Inf. Process. Manag..

[22]  Ying Qin,et al.  Automatic Speech Assessment for Aphasic Patients Based on Syllable-Level Embedding and Supra-Segmental Duration Features , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[24]  D. Jamieson,et al.  Acoustic discrimination of pathological voice: sustained vowels versus continuous speech. , 2001, Journal of speech, language, and hearing research : JSLHR.

[25]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[26]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[27]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.