Audiovisual Phonologic-Feature-Based Recognition of Dysarthric Speech

Automatic dictation software with reasonably high word recognition accuracy is now widely available to the general public. Many people with gross motor impairment, including some people with cerebral palsy and closed head injuries, have not enjoyed the benefit of these advances, because their general motor impairment includes a component of dysarthria: reduced speech intelligibility caused by neuromotor impairment. These motor impairments often preclude normal use of a keyboard. For this reason, case studies have shown that some dysarthric users may find it easier, instead of a keyboard, to use a small-vocabulary automatic speech recognition system, with code words representing letters and formatting commands, and with acoustic speech recognition models carefully adapted to the speech of the individual user. Development of each individualized speech recognition system remains extremely labor-intensive, because so little is understood about the general characteristics of dysarthric speech. We propose to study the general audio and visual characteristics of articulation errors in dysarthric speech, and to apply the results of our scientific study to the development of speaker-independent large-vocabulary and small-vocabulary audio and audiovisual dysarthric speech recognition systems.

[1]  Jennifer Cole,et al.  Speaker-Independent Automatic Detection of Pitch Accent , 2004 .

[2]  Eric Sanders,et al.  Automatic Recognition Of Dutch Dysarthric Speech, A Pilot Study , 2002 .

[3]  Hwa‐Ping Chang Speech input for dysarthric users , 1993 .

[4]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[5]  Stephen E. Levinson,et al.  MENTAL STATE DETECTION OF DIALOGUE SYSTEM USERS VIA SPOKEN LANGUAGE , 2003 .

[6]  A. Aronson,et al.  Differential diagnostic patterns of dysarthria. , 1969, Journal of speech and hearing research.

[7]  Mark Hasegawa-Johnson,et al.  Maximum conditional mutual information projection for speech recognition , 2003, INTERSPEECH.

[8]  Andreas Stolcke,et al.  Modeling prosodic feature sequences for speaker recognition , 2005, Speech Commun..

[9]  Xavier L. Aubert,et al.  An overview of decoding techniques for large vocabulary continuous speech recognition , 2002, Comput. Speech Lang..

[10]  Mark Hasegawa-Johnson,et al.  Acoustic Differentiation of ip and IP Boundary Levels: Comparison of L- and L-L% in the Switchboard Corpus , 2004 .

[11]  Thomas S. Huang,et al.  Bimodal speech recognition using coupled hidden Markov models , 2000, INTERSPEECH.

[12]  G. A. Miller,et al.  An Analysis of Perceptual Confusions Among Some English Consonants , 1955 .

[13]  Stefanie Shattuck-Hufnagel,et al.  Implementation of a model for lexical access based on features , 1992, ICSLP.

[14]  Sheri Hunnicutt,et al.  An investigation of different degrees of dysarthric speech as input to speaker-adaptive and speaker-dependent recognition systems , 2001 .

[15]  Karen A Hux,et al.  Accuracy of three speech recognition systems: Case study of dysarthric speech , 2000 .

[16]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[17]  A. Aronson,et al.  Motor Speech Disorders , 2014 .

[18]  M. Hasegawa-Johnson,et al.  Gaussian mixture models of phonetic boundaries for speech recognition , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[19]  Nancy Thomas-Stonell,et al.  Computerized speech recognition: influence of intelligibility and perceptual consistency on recognition accuracy , 1998 .

[20]  Biing-Hwang Juang,et al.  A study on speaker adaptation of the parameters of continuous density hidden Markov models , 1991, IEEE Trans. Signal Process..

[21]  J. Luettin,et al.  Audio-visual Speech Recognition Workshop 2000 Final Report , 2000 .

[22]  Mark Hasegawa-Johnson,et al.  Maximum mutual information based acoustic-features representation of phonological features for speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Thomas S. Huang,et al.  Multi-Modal sensory Fusion with Application to Audio-Visual Speech Recognition , 2002 .

[24]  Mark Hasegawa-Johnson,et al.  A factorial HMM approach to simultaneous recognition of isolated digits spoken by multiple talkers on one audio channel , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[26]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[27]  Stephen E. Levinson,et al.  Children's emotion recognition in an intelligent tutoring scenario , 2004, INTERSPEECH.

[28]  Russell J. Love,et al.  Childhood Motor Speech Disability , 1991 .

[29]  Stephen E. Levinson,et al.  Automatic detection of contrast for speech understanding , 2004, INTERSPEECH.

[30]  F. Itakura,et al.  Minimum prediction residual principle applied to speech recognition , 1975 .

[31]  Ashok Krishnamurthy,et al.  Voice activity detection using microphone array , 2007 .

[32]  Mark Hasegawa-Johnson,et al.  Landmark-based speech recognition: report of the 2004 Johns Hopkins summer workshop , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[33]  Jont B. Allen,et al.  Articulation and Intelligibility , 2005, Synthesis Lectures on Speech and Audio Processing.

[34]  Ming Liu,et al.  AVICAR: audio-visual speech corpus in a car environment , 2004, INTERSPEECH.

[35]  C. Browman,et al.  Articulatory Phonology: An Overview , 1992, Phonetica.

[36]  Mark Hasegawa-Johnson,et al.  Approximately independent factors of speech using nonlinear symplectic transformation , 2003, IEEE Trans. Speech Audio Process..

[37]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[38]  James R. Glass,et al.  Feature-based Pronunciation Modeling for Speech Recognition , 2004, HLT-NAACL.

[39]  Mark Hasegawa-Johnson,et al.  Distinctive feature based SVM discriminant features for improvements to phone recognition on telephone band speech , 2005, INTERSPEECH.

[40]  M. Hasegawa-Johnson,et al.  CTMRedit: a Matlab-based tool for segmenting and interpolating MRI and CT images in three orthogonal planes , 1999, Proceedings of the First Joint BMES/EMBS Conference. 1999 IEEE Engineering in Medicine and Biology 21st Annual Conference and the 1999 Annual Fall Meeting of the Biomedical Engineering Society (Cat. N.

[41]  M. C. Leske,et al.  Prevalence estimates of communicative disorders in the U.S. Speech disorders. , 1981, ASHA.

[42]  Hwa-Ping Chang Speech input for dysarthric computer users , 1995 .

[43]  Mark Hasegawa-Johnson,et al.  Source separation using particle filters , 2004, INTERSPEECH.

[44]  Stephen E. Levinson,et al.  Semantic analysis for a speech user interface in an intelligent tutoring system , 2004, IUI '04.

[45]  H. Timothy Bunnell,et al.  The Nemours database of dysarthric speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[46]  Mark Hasegawa-Johnson,et al.  Non-linear maximum likelihood feature transformation for speech recognition , 2003, INTERSPEECH.

[47]  P. Ladefoged,et al.  The sounds of the world's languages , 1996 .

[48]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[49]  James R. Glass,et al.  Feature-based pronunciation modeling with trainable asynchrony probabilities , 2004, INTERSPEECH.

[50]  K. Stevens Evidence for the role of acoustic boundaries in the perception of speech sounds , 1981 .

[51]  H. A. Leeper,et al.  Dysarthric speech: a comparison of computerized speech recognition and listener intelligibility. , 1997, Journal of rehabilitation research and development.

[52]  Stephen E. Levinson,et al.  1 An Empathic-tutoring System Using Spoken Language , 2007 .

[53]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[54]  W. Levelt,et al.  Speaking: From Intention to Articulation , 1990 .

[55]  Mark Hasegawa-Johnson,et al.  NON-LINEAR INDEPENDENT COMPONENT ANALYSIS FOR SPEECH RECOGNITION , 2003 .

[56]  Mark Hasegawa-Johnson,et al.  Model enforcement: a unified feature transformation framework for classification and recognition , 2004, IEEE Transactions on Signal Processing.

[57]  Steve Young,et al.  Tree-based state clustering for large vocabulary speech recognition , 1994, Proceedings of ICSIPNN '94. International Conference on Speech, Image Processing and Neural Networks.

[58]  Michael Kenstowicz,et al.  Phonology In Generative Grammar , 1994 .

[59]  D. Caplan,et al.  Language: Structure, Processing, and Disorders , 1994 .

[60]  Raymond D. Kent,et al.  Acoustic studies of dysarthric speech: methods, progress, and potential. , 1999, Journal of communication disorders.

[61]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[62]  Mark Hasegawa BAYESIAN LEARNING FOR MODELS OF HUMAN SPEECH PERCEPTION , 2003 .

[63]  Thomas S. Huang,et al.  Explanation-based facial motion tracking using a piecewise Bezier volume deformation model , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[64]  Nancy Thomas-Stonell,et al.  Effects of speech training on the accuracy of speech recognition for an individual with a speech impairment , 1997 .

[65]  Fangxin Chen,et al.  Optimization of dysarthric speech recognition , 1997, Proceedings of the 19th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. 'Magnificent Milestones and Emerging Opportunities in Medical Engineering' (Cat. No.97CH36136).

[66]  Andreas Stolcke,et al.  THE SRI MARCH 2000 HUB-5 CONVERSATIONAL SPEECH TRANSCRIPTION SYSTEM , 2000 .