An End-to-End Approach to Automatic Speech Assessment for Cantonese-speaking People with Aphasia

Conventional automatic assessment of pathological speech usually follows two main steps: (1) extraction of pathology-specific features; (2) classification or regression on extracted features. Given the great variety of speech and language disorders, feature design is never a straightforward task, and yet it is most crucial to the performance of assessment. This paper presents an end-to-end approach to automatic speech assessment for Cantonese-speaking People With Aphasia (PWA). The assessment is formulated as a binary classification task to discriminate PWA with high scores of subjective assessment from those with low scores. The 2-layer Gated Recurrent Unit (GRU) and Convolutional Neural Network (CNN) models are applied to realize the end-to-end mapping from basic speech features to the classification outcome. The pathology-specific features used for assessment are learned implicitly by the neural network model. The Class Activation Mapping (CAM) method is utilized to visualize how the learned features contribute to the assessment result. Experimental results show that the end-to-end approach can achieve comparable performance to the conventional two-step approach in the classification task, and the CNN model is able to learn impairment-related features that are similar to the hand-crafted features. The experimental results also indicate that CNN model performs better than 2-layer GRU model in this specific task.

[1]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[2]  D. Benson,et al.  Aphasia: A Clinical Perspective , 1996 .

[3]  Emily Mower Provost,et al.  Improving Automatic Recognition of Aphasic Speech with AphasiaBank , 2016, INTERSPEECH.

[4]  Emily Mower Provost,et al.  Automatic quantitative analysis of spontaneous aphasic speech , 2018, Speech Commun..

[5]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[6]  Sam-Po Law,et al.  A Coding System with Independent Annotations of Gesture Forms and Functions During Verbal Communication: Development of a Database of Speech and GEsture (DoSaGE) , 2015, Journal of nonverbal behavior.

[7]  Frank Rudzicz,et al.  Automatic speech recognition in the diagnosis of primary progressive aphasia , 2013, SLPAT.

[8]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[9]  Hisham Adam,et al.  Dysprosody in aphasia: An acoustic analysis evidence from Palestinian Arabic , 2014 .

[10]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11]  E. Yiu,et al.  Linguistic assessment of Chinese-speaking aphasics: Development of a Cantonese aphasia battery , 1992, Journal of Neurolinguistics.

[12]  Frank Rudzicz,et al.  Using text and acoustic features to diagnose progressive aphasia and its subtypes , 2013, INTERSPEECH.

[13]  Ying Qin,et al.  Automatic Speech Assessment for Aphasic Patients Based on Syllable-Level Embedding and Supra-Segmental Duration Features , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Kathleen C. Fraser,et al.  Automated classification of primary progressive aphasia subtypes from narrative speech transcripts , 2014, Cortex.

[15]  Sung Joo Lee,et al.  Spoken English fluency scoring using convolutional neural networks , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Tobias Meisen,et al.  Automatic Processing of Clinical Aphasia Data collected during Diagnosis Sessions: Challenges and Prospects , 2018, LREC 2018.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Haipeng Wang,et al.  Analysis of auto-aligned and auto-segmented oral discourse by speakers with aphasia: A preliminary study on the acoustic parameter of duration. , 2013, Procedia, social and behavioral sciences.

[20]  Ying Qin,et al.  Automatic Speech Assessment for People with Aphasia Using TDNN-BLSTM with Multi-Task Learning , 2018, INTERSPEECH.

[21]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[22]  Jen-Tzung Chien,et al.  Automatic speech recognition for acoustical analysis and assessment of cantonese pathological voice and speech , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[24]  Sam-Po Law,et al.  Cantonese AphasiaBank: An annotated database of spoken discourse and co-verbal gestures by healthy and language-impaired native Cantonese speakers , 2018, Behavior Research Methods.

[25]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[26]  Lianhong Cai,et al.  Question detection from acoustic features using recurrent neural network with gated recurrent unit , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Margaret Forbes,et al.  AphasiaBank: Methods for studying discourse , 2011, Aphasiology.

[28]  Dimitra Vergyri,et al.  Learning diagnostic models using speech and language measures , 2008, 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[29]  Tan Lee,et al.  Enhancing Sound Texture in CNN-based Acoustic Scene Classification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[33]  Elmar Nöth,et al.  Convolutional Neural Network to Model Articulation Impairments in Patients with Parkinson's Disease , 2017, INTERSPEECH.

[34]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[35]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[36]  Miha Vuk,et al.  ROC curve, lift chart and calibration plot , 2006, Advances in Methodology and Statistics.

[37]  Rajib Rana,et al.  Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech , 2016, ArXiv.

[38]  Anthony Pak Hin Kong,et al.  Analysis of Neurogenic Disordered Discourse Production: From Theory to Practice , 2016 .