Silent speech recognition from articulatory movements using deep neural network

Laryngectomee patients lose their ability to produce speech sounds and suffer in their daily communication. There are currently limited communication options for these patients. Silent speech interfaces (SSIs), which recognize speech from articulatory information (i.e., without using audio information), have potential to assist the oral communication of persons with laryngectomy or other speech or voice disorders. One of the challenging problems in SSI development is to accurately recognize speech from articulatory data. Deep neural network (DNN)-hidden Markov model (HMM) has recently been successfully used in (acoustic) speech recognition, which shows significant improvements over the long-standing approach Gaussian mixture model (GMM)-HMM. DNNHMM, however, has rarely been used in silent speech recognition. This paper investigated the use of DNNHMM in recognizing speech from articulatory movement data. The articulatory data in the MOCHA-TIMIT data set was used in the experiment. Results indicated the performance improvement of DNN-HMM over GMMHMM in silent speech recognition.

[1]  B. Prabhakaran,et al.  Word Recognition from Continuous Articulatory Movement Time-series Data using Symbolic Representations , 2013, SLPAT.

[2]  Korin Richmond Preliminary inversion mapping results with a new EMA corpus , 2009, INTERSPEECH.

[3]  Ashok Samal,et al.  Articulatory distinctiveness of vowels and consonants: a data-driven approach. , 2013, Journal of speech, language, and hearing research : JSLHR.

[4]  N Nguyen,et al.  A MATLAB toolbox for the analysis of articulatory data in the production of speech , 2000, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[5]  Giorgio Metta,et al.  Cross-corpus and cross-linguistic evaluation of a speaker-dependent DNN-HMM ASR system using EMA data , 2013 .

[6]  Jun Wang,et al.  Across-speaker articulatory normalization for speaker-independent silent speech recognition , 2014, INTERSPEECH.

[7]  James T. Heaton,et al.  Towards a practical silent speech recognition system , 2014, INTERSPEECH.

[8]  Giorgio Metta,et al.  Relevance-weighted-reconstruction of articulatory features in deep-neural-network-based acoustic-to-articulatory mapping , 2013, INTERSPEECH.

[9]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Stuart Cunningham,et al.  Reconstructing the Voice of an Individual Following Laryngectomy , 2011, Augmentative and alternative communication.

[11]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[12]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[13]  Gérard Chollet,et al.  Visuo-phonetic decoding using multi-stream and context-dependent models for an ultrasound-based silent speech interface , 2009, INTERSPEECH.

[14]  Sorin Dusan,et al.  Speech interfaces based upon surface electromyography , 2010, Speech Commun..

[15]  Alan Wrench,et al.  Continuous speech recognition using articulatory data , 2000, INTERSPEECH.

[16]  J. M. Gilbert,et al.  Development of a (silent) speech recognition system for patients following laryngectomy. , 2008, Medical engineering & physics.

[17]  Jun Cai,et al.  Tests of an Interactive, Phrasebook-style Post-laryngectomy Voice-replacement System , 2011, ICPhS.

[18]  Alex Acero,et al.  Recent improvements on Microsoft's trainable text-to-speech system-Whistler , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Geoffrey Zweig,et al.  Recent advances in deep learning for speech research at Microsoft , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Thomas Hain,et al.  Automatic speech recognition experiments with articulatory data , 2006, INTERSPEECH.

[21]  H. Zen,et al.  Continuous Stochastic Feature Mapping Based on Trajectory HMMs , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Gérard Chollet,et al.  Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips , 2010, Speech Commun..

[23]  Norihiro Hagita,et al.  Automatic recognition of speech without any audio information , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Takao Kobayashi,et al.  Constrained structural maximum a posteriori linear regression for average-voice-based speech synthesis , 2006, INTERSPEECH.

[25]  Jun Wang,et al.  Preliminary Test of a Real-Time, Interactive Silent Speech Interface Based on Electromagnetic Articulograph , 2014, SLPAT@ACL.

[26]  Hank Liao,et al.  Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Frank Rudzicz,et al.  Correcting Errors in Speech Recognition with Articulatory Dynamics , 2010, ACL.

[28]  J. M. Gilbert,et al.  Silent speech interfaces , 2010, Speech Commun..

[29]  Jun Wang,et al.  Individual articulator's contribution to phoneme production , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Jun Wang,et al.  Sentence recognition from articulatory movements for silent speech interfaces , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Hanjun Liu,et al.  Electrolarynx in voice rehabilitation. , 2007, Auris, nasus, larynx.

[32]  Masakiyo Fujimoto,et al.  Feature space variational Bayesian linear regression and its combination with model space VBLR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Simon King,et al.  Speech production knowledge in automatic speech recognition. , 2007, The Journal of the Acoustical Society of America.

[34]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.