Source Domain Data Selection for Improved Transfer Learning Targeting Dysarthric Speech Recognition

This paper presents an improved transfer learning framework applied to robust personalised speech recognition models for speakers with dysarthria. As the baseline of transfer learning, a state-of-the-art CNN-TDNN-F ASR acoustic model trained solely on source domain data is adapted onto the target domain via neural network weight adaptation with the limited available data from target dysarthric speakers. Results show that linear weights in neural layers play the most important role for an improved modelling of dysarthric speech evaluated using UASpeech corpus, achieving averaged 11.6% and 7.6% relative recognition improvement in comparison to the conventional speaker-dependent training and data combination, respectively. To further improve the transferability towards target domain, we propose an utterance-based data selection of the source domain data based on the entropy of posterior probability, which is analysed to statistically obey a Gaussian distribution. Compared to a speaker-based data selection via dysarthria similarity measure, this allows for a more accurate selection of the potentially beneficial source domain data for either increasing the target domain training pool or constructing an intermediate domain for incremental transfer learning, resulting in a further absolute recognition performance improvement of nearly 2% added to transfer learning baseline for speakers with moderate to severe dysarthria.

[1]  Sunil Kumar Kopparapu,et al.  Data Augmentation Using Healthy Speech for Dysarthric Speech Recognition , 2018, INTERSPEECH.

[2]  Dong Yu,et al.  Active Learning and Semi-supervised Learning for Speech Recognition: a Unified Framework Using the Global Entropy Reduction Maximization Criterion Computer Speech and Language Article in Press Active Learning and Semi-supervised Learning for Speech Recognition: a Unified Framework Using the Global E , 2022 .

[3]  Tetsuya Takiguchi,et al.  Dysarthric speech recognition using a convolutive bottleneck network , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[4]  Jon Barker,et al.  Channel selection using neural network posterior probability for speech recognition with distributed microphone arrays in everyday environments , 2018 .

[5]  Xiao Li,et al.  Machine Learning Paradigms for Speech Recognition: An Overview , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Frank Rudzicz,et al.  The TORGO database of acoustic and articulatory speech from speakers with dysarthria , 2011, Language Resources and Evaluation.

[7]  Bronagh Blaney, John Wilson Acoustic variability in dysarthria and computer speech recognition , 2000 .

[8]  Jianwei Yu,et al.  Development of the CUHK Dysarthric Speech Recognition System for the UA Speech Corpus , 2018, INTERSPEECH.

[9]  Heidi Christensen,et al.  Combining in-domain and out-of-domain speech data for automatic recognition of disordered speech , 2013, INTERSPEECH.

[10]  Heidi Christensen,et al.  A comparative study of adaptive, automatic recognition of disordered speech , 2012, INTERSPEECH.

[11]  Chin-Hui Lee,et al.  A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition , 2016, Neurocomputing.

[12]  Thomas S. Huang,et al.  Dysarthric speech database for universal access research , 2008, INTERSPEECH.

[13]  Kaisheng Yao,et al.  A basis method for robust estimation of constrained MLLR , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  H. Timothy Bunnell,et al.  The Nemours database of dysarthric speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[15]  Pietro Laface,et al.  Linear hidden transformations for adaptation of hybrid ANN/HMM models , 2007, Speech Commun..

[16]  Sanjeev Khudanpur,et al.  Investigation of transfer learning for ASR using LF-MMI trained neural networks , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[17]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[18]  Kristin Rosen,et al.  Automatic speech recognition and a review of its functioning with dysarthric speech , 2000 .

[19]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[20]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[21]  C A Nelson,et al.  Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[22]  Jon Barker,et al.  Phonetic Analysis of Dysarthric Speech Tempo and Applications to Robust Personalised Dysarthric Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Yiming Wang,et al.  Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks , 2018, INTERSPEECH.

[25]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[26]  Stuart P. Cunningham,et al.  Model adaptation and adaptive training for the recognition of dysarthric speech , 2015, SLPAT@Interspeech.

[27]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[28]  Heidi Christensen,et al.  Automatic selection of speakers for improved acoustic modelling: recognition of disordered speech with sparse data , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[29]  J. Deller,et al.  The Whitaker database of dysarthric (cerebral palsy) speech. , 1993, The Journal of the Acoustical Society of America.

[30]  A. Mihailidis,et al.  Difficulties in Automatic Speech Recognition of Dysarthric Speakers and Implications for Speech-Based Applications Used by the Elderly: A Literature Review , 2010, Assistive technology : the official journal of RESNA.

[31]  A. Aronson,et al.  Clusters of deviant speech dimensions in the dysarthrias. , 1969, Journal of speech and hearing research.

[32]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.