Updating the silent speech challenge benchmark with deep learning

The 2010 Silent Speech Challenge benchmark is updated with new results obtained in a Deep Learning strategy, using the same input features and decoding strategy as in the original article. A Word Error Rate of 6.4% is obtained, compared to the published value of 17.4%. Additional results comparing new auto-encoder-based features with the original features at reduced dimensionality, as well as decoding scenarios on two different language models, are also presented. The Silent Speech Challenge archive has been updated to contain both the original and the new auto-encoder features, in addition to the original raw data.

[1]  Kazuhiko Yamamoto,et al.  Decoding Silent Speech in Japanese from Single Trial EEGS: Preliminary Results , 2015 .

[2]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Thomas Hueber,et al.  Statistical conversion of silent articulation into audible speech using full-covariance HMM , 2016, Comput. Speech Lang..

[4]  C. Kambhamettu,et al.  Automatic contour tracking in ultrasound images , 2005, Clinical linguistics & phonetics.

[5]  Tamás Gábor Csapó,et al.  Convolutional neural network-based automatic classification of midsagittal tongue gestural targets using B-mode ultrasound images. , 2017, The Journal of the Acoustical Society of America.

[6]  Laurent Girin,et al.  Robust articulatory speech synthesis using deep neural networks for BCI applications , 2014, INTERSPEECH.

[7]  Tomoki Toda,et al.  Improvement to a NAM-captured whisper-to-speech system , 2010, Speech Commun..

[8]  Gérard Chollet,et al.  Eigentongue Feature Extraction for an Ultrasound-Based Silent Speech Interface , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[10]  Jun Cai,et al.  Tests of an Interactive, Phrasebook-style Post-laryngectomy Voice-replacement System , 2011, ICPhS.

[11]  Thomas Hueber Reconstitution de la parole par imagerie ultrasonore et vidéo de l'appareil vocal : vers une communication parlée silencieuse , 2009 .

[12]  J. M. Gilbert,et al.  Silent speech interfaces , 2010, Speech Commun..

[13]  Shmuel Peleg,et al.  Vid2speech: Speech reconstruction from silent video , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Maureen Stone,et al.  Robust contour tracking in ultrasound tongue image sequences , 2016, Clinical linguistics & phonetics.

[15]  Françoise Fogelman-Soulié,et al.  Experiments with time delay networks and dynamic time warping for speaker independent isolated digits recognition , 1989, EUROSPEECH.

[16]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[17]  Gérard Bailly,et al.  Audio-visual speaker conversion using prosody features , 2013, AVSP.

[18]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[19]  Jun Cai,et al.  Recognition and Real Time Performances of a Lightweight Ultrasound Based Silent Speech Interface Employing a Language Model , 2011, INTERSPEECH.

[20]  Gérard Chollet,et al.  Visuo-phonetic decoding using multi-stream and context-dependent models for an ultrasound-based silent speech interface , 2009, INTERSPEECH.

[21]  Tanja Schultz,et al.  Pattern learning with deep neural networks in EMG-based speech recognition , 2014, 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[23]  Vlado Delic,et al.  Deep Neural Network Based Continuous Speech Recognition for Serbian Using the Kaldi Toolkit , 2015, SPECOM.

[24]  Lisa Tang,et al.  Graph-based tracking of the tongue contour in ultrasound sequences with adaptive temporal regularization , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[25]  Laurent Girin,et al.  Real-time control of a DNN-based articulatory synthesizer for silent speech conversion: a pilot study , 2015, INTERSPEECH.

[26]  Maysam Ghovanloo,et al.  The tongue and ear interface: a wearable system for silent speech recognition , 2014, SEMWEB.

[27]  Tomoki Toda,et al.  Silent-speech enhancement using body-conducted vocal-tract resonance signals , 2010, Speech Commun..

[28]  António J. S. Teixeira,et al.  Multimodal Corpora for Silent Speech Interaction , 2014, LREC.

[29]  Gérard Bailly,et al.  Continuous Articulatory-to-Acoustic Mapping using Phone-based Trajectory HMM for a Silent Speech Interface , 2012, INTERSPEECH.

[30]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[31]  Tanja Schultz,et al.  Modeling coarticulation in EMG-based continuous speech recognition , 2010, Speech Commun..

[32]  Phil D. Green,et al.  Analysis of phonetic similarity in a silent speech interface based on permanent magnetic articulography , 2014, INTERSPEECH.

[33]  Samuel S. Silva,et al.  Detecting Nasal Vowels in Speech Interfaces Based on Surface Electromyography , 2015, PloS one.

[34]  Gérard Chollet,et al.  Statistical mapping between articulatory and acoustic data. Application to Silent Speech Interface and Visual Articulatory Feedback , 2011 .

[35]  Phil D. Green,et al.  Integrating User-Centred Design in the Development of a Silent Speech Interface Based on Permanent Magnetic Articulography , 2015, BIOSTEC.

[36]  Bruce Denby,et al.  Comparison of DCT and autoencoder-based features for DNN-HMM multimodal silent speech recognition , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[37]  Gérard Chollet,et al.  Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips , 2010, Speech Commun..

[38]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[39]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[40]  Javier M. Antelis,et al.  Syllable-based speech recognition using EMG , 2010, 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology.

[41]  Phil D. Green,et al.  Small-vocabulary speech recognition using a silent speech interface based on magnetic sensing , 2013, Speech Commun..

[42]  Jun Wang,et al.  Preliminary Test of a Real-Time, Interactive Silent Speech Interface Based on Electromagnetic Articulograph , 2014, SLPAT@ACL.

[43]  Dong Yu,et al.  Large vocabulary continuous speech recognition with context-dependent DBN-HMMS , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Tanja Schultz,et al.  Session-independent EMG-based Speech Recognition , 2011, BIOSIGNALS.

[45]  Bruce Denby,et al.  Speech synthesis from real time ultrasound images of the tongue , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[46]  Gérard Chollet,et al.  Towards a Practical Silent Speech Interface Based on Vocal Tract Imaging , 2011 .

[47]  John H. L. Hansen,et al.  The physiological microphone (PMIC): A competitive alternative for speaker assessment in stress detection and speaker verification , 2010, Speech Commun..

[48]  Lise Crevier-Buchman,et al.  Silent vs vocalized articulation for a portable ultrasound-based silent speech interface , 2010, INTERSPEECH.

[49]  P. Yip,et al.  Discrete Cosine Transform: Algorithms, Advantages, Applications , 1990 .

[50]  Kamer Ali Yüksel,et al.  Designing mobile phones using silent speech input and auditory feedback , 2011, Mobile HCI.

[51]  Tanja Schultz,et al.  Artifact removal algorithm for an EMG-based Silent Speech Interface , 2013, 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[52]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[53]  Frank H. Guenther,et al.  Brain-computer interfaces for speech communication , 2010, Speech Commun..

[54]  May Salama,et al.  Recognition of Unspoken Words Using Electrode Electroencephalograhic Signals , 2014 .

[55]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[56]  Jun Wang,et al.  Determining an Optimal Set of Flesh Points on Tongue, Lips, and Jaw for Continuous Silent Speech Recognition , 2015, SLPAT@Interspeech.

[57]  J. M. Gilbert,et al.  Development of a (silent) speech recognition system for patients following laryngectomy. , 2008, Medical engineering & physics.

[58]  Mariko Matsumoto,et al.  Brain Computer Interface using Silent Speech for Speech Assistive Device , 2014 .

[59]  Pierre Roussel-Ragot,et al.  Tongue contour extraction from ultrasound images based on deep neural network , 2015, ICPhS.

[60]  António Teixeira,et al.  Towards a Multimodal Silent Speech Interface for European Portuguese , 2011 .

[61]  Tanja Schultz,et al.  Synthesizing speech from electromyography using voice transformation techniques , 2009, INTERSPEECH.

[62]  Liqaa Alhafadhi,et al.  Review of EMG-based Speech Recognition , 2015 .

[63]  Jun Wang,et al.  Sentence recognition from articulatory movements for silent speech interfaces , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[64]  George H. Weiss,et al.  Analysis of real-time ultrasound images of tongue configuration using a grid-digitizing system , 1983 .

[65]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[66]  Luis Villaseñor Pineda,et al.  Toward a Silent Speech Interface based on Unspoken Speech , 2012, BIOSIGNALS.

[67]  Jun Wang,et al.  Silent speech recognition from articulatory movements using deep neural network , 2015, ICPhS.

[68]  Makoto Sato,et al.  Spatial filtering and single-trial classification of EEG during vowel speech imagery , 2009, i-CREATe.

[69]  Thomas Hueber,et al.  EXTRACTION USING MULTIMODAL CONVOLUTIONAL NEURAL NETWORKS FOR VISUAL SPEECH RECOGNITION , 2017 .

[70]  Gérard Chollet,et al.  Phone recognition from ultrasound and optical video sequences for a silent speech interface , 2008, INTERSPEECH.

[71]  Gérard Chollet,et al.  Acquisition of Ultrasound, Video and Acoustic Speech Data for a Silent-Speech Interface Application , 2008 .

[72]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[73]  António J. S. Teixeira,et al.  Enhancing multimodal silent speech interfaces with feature selection , 2014, INTERSPEECH.

[74]  Pierre Roussel-Ragot,et al.  Development of a 3D tongue motion visualization platform based on ultrasound image sequences , 2015, ICPhS.

[75]  Wenshi Li Silent Speech Interface Design Methodology and Case Study , 2016 .

[76]  M. Stone A guide to analysing tongue motion from ultrasound images , 2005, Clinical linguistics & phonetics.