Multi-view Representation Learning via Canonical Correlation Analysis for Dysarthric Speech Recognition

Although automatic speech recognition (ASR) has been commercially used for general public, it still does not perform sufficiently well for people with speech disorders (e.g., dysarthria). Multimodal ASR, which involves multiple sources of signals, has recently shown potential to improve the performance of dysarthric speech recognition. When multiple views (sources) of data (e.g., acoustic and articulatory) are available for training while only one view (e.g., acoustic) is available for testing, a better representation can be learned by simultaneously analyzing multiple sources of data. Although multi-view representation learning has recently used in normal speech recognition, it has rarely been studied in dysarthric speech recognition. In this paper, we investigate the effectiveness of multi-view representation learning via canonical correlation analysis (CCA) for dysarthric speech recognition. A representation of acoustic data is learned using CCA from the multi-view data (acoustic and articulatory). The articulatory data was simultaneously recorded with acoustic data using electromagnetic articulograph. Experimental evaluation on a database collected from nine patients with dysarthria due to Lou Gehrig’s disease demonstrated the effectiveness of the multi-view representation learning via CCA on deep neural network-based speech recognition systems.

[1]  Gernot A. Fink,et al.  Combining acoustic and articulatory feature information for robust speech recognition , 2002, Speech Commun..

[2]  José A. R. Fonollosa,et al.  Automatic Speech Recognition with Deep Neural Networks for Impaired Speech , 2016, IberSPEECH.

[3]  Jeff A. Bilmes,et al.  On Deep Multi-View Representation Learning , 2015, ICML.

[4]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[5]  Myung Jong Kim,et al.  Multiview Representation Learning via Deep CCA for Silent Speech Recognition , 2017, INTERSPEECH.

[6]  Myungjong Kim,et al.  Speaker-Independent Silent Speech Recognition From Flesh-Point Articulatory Movements Using an LSTM Neural Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Jun Wang,et al.  Recognizing Dysarthric Speech due to Amyotrophic Lateral Sclerosis with Across-Speaker Articulatory Normalization , 2015, SLPAT@Interspeech.

[8]  Jeff A. Bilmes,et al.  Unsupervised learning of acoustic features via deep canonical correlation analysis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Jun Wang,et al.  Preliminary Test of a Real-Time, Interactive Silent Speech Interface Based on Electromagnetic Articulograph , 2014, SLPAT@ACL.

[10]  Ashok Samal,et al.  An Optimal Set of Flesh Points on Tongue and Lips for Speech-Movement Classification. , 2016, Journal of speech, language, and hearing research : JSLHR.

[11]  F Rudzicz,et al.  Articulatory Knowledge in the Recognition of Dysarthric Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Mauro Cettolo,et al.  IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[13]  Myung Jong Kim,et al.  Dysarthric Speech Recognition Using Kullback-Leibler Divergence-Based Hidden Markov Model , 2016, INTERSPEECH.

[14]  Ashok Samal,et al.  Articulatory distinctiveness of vowels and consonants: a data-driven approach. , 2013, Journal of speech, language, and hearing research : JSLHR.

[15]  Myungjong Kim,et al.  Recognizing Whispered Speech Produced by an Individual with Surgically Reconstructed Larynx Using Articulatory Movement Data. , 2016, Workshop on Speech and Language Processing for Assistive Technologies.

[16]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[17]  Jeffrey J Berry,et al.  Accuracy of the NDI wave speech research system. , 2011, Journal of speech, language, and hearing research : JSLHR.

[18]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[19]  Myung Jong Kim,et al.  Dysarthric speech recognition using dysarthria-severity-dependent and speaker-adaptive models , 2013, INTERSPEECH.

[20]  Tetsuya Takiguchi,et al.  Dysarthric speech recognition using a convolutive bottleneck network , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Dong Yu,et al.  Deep Learning: Methods and Applications , 2014, Found. Trends Signal Process..

[23]  Heidi Christensen,et al.  Combining in-domain and out-of-domain speech data for automatic recognition of disordered speech , 2013, INTERSPEECH.

[24]  Gary L. Pattee,et al.  Bulbar and speech motor assessment in ALS: Challenges and future directions , 2013, Amyotrophic lateral sclerosis & frontotemporal degeneration.

[25]  Helmer Strik,et al.  Multi-Stage DNN Training for Automatic Recognition of Dysarthric Speech , 2017, INTERSPEECH.

[26]  Younggwan Kim,et al.  Regularized Speaker Adaptation of KL-HMM for Dysarthric Speech Recognition , 2017, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[27]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .