Staged Knowledge Distillation for End-to-End Dysarthric Speech Recognition and Speech Attribute Transcription

This study proposes a staged knowledge distillation method to build End-to-End (E2E) automatic speech recognition (ASR) and automatic speech attribute transcription (ASAT) systems for patients with dysarthria caused by either cerebral palsy (CP) or amyotrophic lateral sclerosis (ALS). Compared with traditional methods, our proposed method can use limited dysarthric speech more effectively. And the dysarthric E2E-ASR and ASAT systems enhanced by the proposed method can achieve 38.28% relative phone error rate (PER%) reduction and 48.33% relative attribute detection error rate (DER%) reduction over their baselines respectively on the TORGO dataset. The experiments show that our system offers potential as a rehabilitation tool and medical diagnostic aid.

[1]  Longbiao Wang,et al.  End-to-End Articulatory Modeling for Dysarthric Articulatory Attribute Detection , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[3]  Tetsuya Takiguchi,et al.  Phoneme-Discriminative Features for Dysarthric Speech Conversion , 2017, INTERSPEECH.

[4]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Yossi Matias,et al.  Personalizing ASR for Dysarthric and Accented Speech with Limited Data , 2019, INTERSPEECH.

[6]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[7]  Jinsong Zhang,et al.  Articulatory Modeling for Pronunciation Error Detection without Non-Native Training Data Based on DNN Transfer Learning , 2017, IEICE Trans. Inf. Syst..

[8]  Raymond D. Kent,et al.  Acoustic studies of dysarthric speech: methods, progress, and potential. , 1999, Journal of communication disorders.

[9]  Tatsuya Kawahara,et al.  End-to-End Articulatory Attribute Modeling for Low-Resource Multilingual Speech Recognition , 2019, INTERSPEECH.

[10]  H. A. Leeper,et al.  Dysarthric speech: a comparison of computerized speech recognition and listener intelligibility. , 1997, Journal of rehabilitation research and development.

[11]  Raymond D. Kent Research on speech motor control and its disorders: a review and prospective. , 2000, Journal of communication disorders.

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[14]  Frank Rudzicz,et al.  Learning mixed acoustic/articulatory models for disabled speech , 2010 .

[15]  Tatsuya Kawahara,et al.  Improving Transformer-Based Speech Recognition Systems with Compressed Structure and Speech Attributes Augmentation , 2019, INTERSPEECH.

[16]  Visar Berisha,et al.  Simulating Dysarthric Speech for Training Data Augmentation in Clinical Speech Applications , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Paavo Alku,et al.  Dysarthric Speech Classification Using Glottal Features Computed from Non-words, Words and Sentences , 2018, INTERSPEECH.

[18]  Tetsuya Takiguchi,et al.  Two-Step Acoustic Model Adaptation for Dysarthric Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Helmer Strik ASR-based systems for language learning and therapy , 2012 .

[20]  Yifan Gong,et al.  Conditional Teacher-student Learning , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Frank Rudzicz,et al.  The TORGO database of acoustic and articulatory speech from speakers with dysarthria , 2011, Language Resources and Evaluation.

[22]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Shuang Xu,et al.  Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese , 2018, INTERSPEECH.

[24]  Lan Wang,et al.  Cross Linguistic Comparison of Mandarin and English EMA Articulatory Data , 2012, INTERSPEECH.

[25]  Horacio Franco,et al.  Articulatory Features for ASR of Pathological Speech , 2018, INTERSPEECH.

[26]  Tetsuya Takiguchi,et al.  End-to-end Dysarthric Speech Recognition Using Multiple Databases , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).