Personalizing ASR for Dysarthric and Accented Speech with Limited Data

Automatic speech recognition (ASR) systems have dramatically improved over the last few years. ASR systems are most often trained from 'typical' speech, which means that underrepresented groups don't experience the same level of improvement. In this paper, we present and evaluate finetuning techniques to improve ASR for users with non-standard speech. We focus on two types of non-standard speech: speech from people with amyotrophic lateral sclerosis (ALS) and accented speech. We train personalized models that achieve 62% and 35% relative WER improvement on these two groups, bringing the absolute WER for ALS speakers, on a test set of message bank phrases, down to 10% for mild dysarthria and 20% for more serious dysarthria. We show that 71% of the improvement comes from only 5 minutes of training data. Finetuning a particular subset of layers (with many fewer parameters) often gives better results than finetuning the entire model. This is the first step towards building state of the art ASR models for dysarthric speech.

[1]  Horacio Franco,et al.  Articulatory Features for ASR of Pathological Speech , 2018, INTERSPEECH.

[2]  Tara N. Sainath,et al.  A Comparison of Sequence-to-Sequence Models for Speech Recognition , 2017, INTERSPEECH.

[3]  Doris Mücke,et al.  Age-related Effects on Sensorimotor Control of Speech Production , 2018, INTERSPEECH.

[4]  Ricardo Gutierrez-Osuna,et al.  L2-ARCTIC: A Non-native English Speech Corpus , 2018, INTERSPEECH.

[5]  Tara N. Sainath,et al.  Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling , 2019, ArXiv.

[6]  Tara N. Sainath,et al.  Domain Adaptation Using Factorized Hidden Layer for Robust Automatic Speech Recognition , 2018, INTERSPEECH.

[7]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Chng Eng Siong,et al.  Severity-Based Adaptation with Limited Data for ASR to Aid Dysarthric Speakers , 2014, PloS one.

[10]  Paavo Alku,et al.  Dysarthric Speech Classification Using Glottal Features Computed from Non-words, Words and Sentences , 2018, INTERSPEECH.

[11]  Frank Rudzicz,et al.  Adapting acoustic and lexical models to dysarthric speech , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[14]  Cristian Danescu-Niculescu-Mizil,et al.  Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs , 2011, CMCL@ACL.

[15]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[16]  Julien Meyer,et al.  Phoneme Resistance and Phoneme Confusion in Noise: Impact of Dyslexia , 2018, INTERSPEECH.

[17]  Adam Lopez,et al.  Pre-training on high-resource speech recognition improves low-resource speech-to-text translation , 2018, NAACL.

[18]  Véronique Delvaux,et al.  Towards a Better Characterization of Parkinsonian Speech: A Multidimensional Acoustic Study , 2018, INTERSPEECH.

[19]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[20]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).