Speaker-Independent Silent Speech Recognition From Flesh-Point Articulatory Movements Using an LSTM Neural Network

Silent speech recognition (SSR) converts nonaudio information such as articulatory movements into text. SSR has the potential to enable persons with laryngectomy to communicate through natural spoken expression. Current SSR systems have largely relied on speaker-dependent recognition models. The high degree of variability in articulatory patterns across different speakers has been a barrier for developing effective speaker-independent SSR approaches. Speaker-independent SSR approaches, however, are critical for reducing the amount of training data required from each speaker. In this paper, we investigate speaker-independent SSR from the movements of flesh points on tongue and lips with articulatory normalization methods that reduce the interspeaker variation. To minimize the across-speaker physiological differences of the articulators, we propose Procrustes matching-based articulatory normalization by removing locational, rotational, and scaling differences. To further normalize the articulatory data, we apply feature-space maximum likelihood linear regression and i-vector. In this paper, we adopt a bidirectional long short-term memory recurrent neural network (BLSTM) as an articulatory model to effectively model the articulatory movements with long-range articulatory history. A silent speech dataset with flesh-point articulatory movements was collected using an electromagnetic articulograph from 12 healthy and two laryngectomized English speakers. Experimental results showed the effectiveness of our speaker-independent SSR approaches on healthy as well as laryngectomy speakers. In addition, BLSTM outperformed the standard deep neural network. The best performance was obtained by the BLSTM with all the three normalization approaches combined.

[1]  Jun Wang,et al.  Whole-Word Recognition from Articulatory Movements for Silent Speech Interfaces , 2012, INTERSPEECH.

[2]  Myung Jong Kim,et al.  Integrating Articulatory Information in Deep Learning-Based Text-to-Speech Synthesis , 2017, INTERSPEECH.

[3]  Maysam Ghovanloo,et al.  Multimodal Speech Capture System for Speech Rehabilitation and Learning , 2017, IEEE Transactions on Biomedical Engineering.

[4]  Andrew W. Senior,et al.  Improving DNN speaker independence with I-vector inputs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[6]  J. M. Gilbert,et al.  Silent speech interfaces , 2010, Speech Commun..

[7]  Jun Wang,et al.  Individual articulator's contribution to phoneme production , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Myung Jong Kim,et al.  Multiview Representation Learning via Deep CCA for Silent Speech Recognition , 2017, INTERSPEECH.

[9]  Ashok Samal,et al.  An Optimal Set of Flesh Points on Tongue and Lips for Speech-Movement Classification. , 2016, Journal of speech, language, and hearing research : JSLHR.

[10]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[11]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Adrian P. Simpson Gender-specific differences in the articulatory and acoustic realization of interword vowel sequences in American English , 2000 .

[13]  Jun Wang,et al.  Across-speaker articulatory normalization for speaker-independent silent speech recognition , 2014, INTERSPEECH.

[14]  Gérard Chollet,et al.  Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips , 2010, Speech Commun..

[15]  Ricardo Gutierrez-Osuna,et al.  Normalization of articulatory data through Procrustes transformations and analysis-by-synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[17]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[18]  Tanja Schultz,et al.  Analysis of phone confusion in EMG-based speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Stuart Cunningham,et al.  Reconstructing the Voice of an Individual Following Laryngectomy , 2011, Augmentative and alternative communication.

[20]  Rainer Martin,et al.  Models of Speech Production and Hearing , 2006 .

[21]  Jun Wang,et al.  Determining an Optimal Set of Flesh Points on Tongue, Lips, and Jaw for Continuous Silent Speech Recognition , 2015, SLPAT@Interspeech.

[22]  J. M. Gilbert,et al.  Development of a (silent) speech recognition system for patients following laryngectomy. , 2008, Medical engineering & physics.

[23]  Raymond D. Kent,et al.  chapter 3 – Models of Speech Production , 1976 .

[24]  Ashok Samal,et al.  Articulatory distinctiveness of vowels and consonants: a data-driven approach. , 2013, Journal of speech, language, and hearing research : JSLHR.

[25]  Raymond D. Kent,et al.  X‐ray microbeam speech production database , 1990 .

[26]  P Ladefoged,et al.  Individual differences in vowel production. , 1993, The Journal of the Acoustical Society of America.

[27]  Jun Wang,et al.  Silent speech recognition from articulatory movements using deep neural network , 2015, ICPhS.

[28]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[29]  R. Chan,et al.  Modulating phonation through alteration of vocal fold medial surface contour , 2012, The Laryngoscope.

[30]  Ted Mau,et al.  Diagnostic evaluation and management of hoarseness. , 2010, The Medical clinics of North America.

[31]  Sorin Dusan,et al.  Speech interfaces based upon surface electromyography , 2010, Speech Commun..

[32]  Mary J. Lindstrom,et al.  Differences among speakers in lingual articulation for American English /r/ , 1998, Speech Commun..

[33]  Björn W. Schuller,et al.  Combining Long Short-Term Memory and Dynamic Bayesian Networks for Incremental Emotion-Sensitive Artificial Listening , 2010, IEEE Journal of Selected Topics in Signal Processing.

[34]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[35]  Jun Wang,et al.  SMASH: a tool for articulatory data processing and analysis , 2013, INTERSPEECH.

[36]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[37]  Myung Jong Kim,et al.  Automatic Intelligibility Assessment of Dysarthric Speech Using Phonologically-Structured Sparse Linear Model , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[38]  Benoît Favre,et al.  Speaker adaptation of DNN-based ASR with i-vectors: does it actually adapt models to speakers? , 2014, INTERSPEECH.

[39]  Kaisheng Yao,et al.  A basis representation of constrained MLLR transforms for robust adaptation , 2012, Comput. Speech Lang..

[40]  Myung Jong Kim,et al.  Dysarthric Speech Recognition Using Kullback-Leibler Divergence-Based Hidden Markov Model , 2016, INTERSPEECH.

[41]  Myungjong Kim,et al.  Recognizing Whispered Speech Produced by an Individual with Surgically Reconstructed Larynx Using Articulatory Movement Data. , 2016, Workshop on Speech and Language Processing for Assistive Technologies.

[42]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[43]  Jun Wang,et al.  Speaker-independent silent speech recognition with across-speaker articulatory normalization and speaker adaptive training , 2015, INTERSPEECH.

[44]  Jun Cai,et al.  Tests of an Interactive, Phrasebook-style Post-laryngectomy Voice-replacement System , 2011, ICPhS.

[45]  Lan Wang,et al.  Cross Linguistic Comparison of Mandarin and English EMA Articulatory Data , 2012, INTERSPEECH.

[46]  J R Westbury,et al.  Vowel posture normalization. , 1998, The Journal of the Acoustical Society of America.

[47]  Phil D. Green,et al.  Small-vocabulary speech recognition using a silent speech interface based on magnetic sensing , 2013, Speech Commun..

[48]  Jun Wang,et al.  Preliminary Test of a Real-Time, Interactive Silent Speech Interface Based on Electromagnetic Articulograph , 2014, SLPAT@ACL.

[49]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[50]  Farzin Mokhtarian,et al.  Human motion recognition based on statistical shape analysis , 2005, IEEE Conference on Advanced Video and Signal Based Surveillance, 2005..

[51]  Tieniu Tan,et al.  Gait recognition based on Procrustes shape analysis , 2002, Proceedings. International Conference on Image Processing.

[52]  Hanjun Liu,et al.  Electrolarynx in voice rehabilitation. , 2007, Auris, nasus, larynx.

[53]  James T. Heaton,et al.  Towards a practical silent speech recognition system , 2014, INTERSPEECH.

[54]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[55]  Younggwan Kim,et al.  Regularized Speaker Adaptation of KL-HMM for Dysarthric Speech Recognition , 2017, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[56]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[57]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[58]  Jeffrey J Berry,et al.  Accuracy of the NDI wave speech research system. , 2011, Journal of speech, language, and hearing research : JSLHR.