A silent speech system based on permanent magnet articulography and direct synthesis

HighlightsThis paper introduces a 'Silent Speech Interface' with the potential to restore the power of speech to people who have completely lost their voices.Small, unobtrusive magnets are attached to the lips and tongues and changes in magnetic field are sensed as the 'speaker' mouths what s/he wants to say.The sensor data is transformed to acoustic data by a speaker-dependent, learned transformation over parallel acoustic and sensor data.The machine learning technique used here is Mixture of Factor Analysis.Results are presented for 3 speakers which demonstrate that the SSI is capable of producing 'speech' which is both intelligible and natural. In this paper we present a silent speech interface (SSI) system aimed at restoring speech communication for individuals who have lost their voice due to laryngectomy or diseases affecting the vocal folds. In the proposed system, articulatory data captured from the lips and tongue using permanent magnet articulography (PMA) are converted into audible speech using a speaker-dependent transformation learned from simultaneous recordings of PMA and audio signals acquired before laryngectomy. The transformation is represented using a mixture of factor analysers, which is a generative model that allows us to efficiently model non-linear behaviour and perform dimensionality reduction at the same time. The learned transformation is then deployed during normal usage of the SSI to restore the acoustic speech signal associated with the captured PMA data. The proposed system is evaluated using objective quality measures and listening tests on two databases containing PMA and audio recordings for normal speakers. Results show that it is possible to reconstruct speech from articulator movements captured by an unobtrusive technique without an intermediate recognition step. The SSI is capable of producing speech of sufficient intelligibility and naturalness that the speaker is clearly identifiable, but problems remain in scaling up the process to function consistently for phonetically rich vocabularies.

[1]  Tomoki Toda,et al.  Implementation of Computationally Efficient Real-Time Voice Conversion , 2012, INTERSPEECH.

[2]  Shinji Maeda,et al.  A digital simulation method of the vocal-tract system , 1982, Speech Commun..

[3]  Tanja Schultz,et al.  Tackling Speaking Mode Varieties in EMG-Based Speech Recognition , 2014, IEEE Transactions on Biomedical Engineering.

[4]  Gérard Chollet,et al.  Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips , 2010, Speech Commun..

[5]  T. W. Anderson An Introduction to Multivariate Statistical Analysis , 1959 .

[6]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[7]  E. Petajan,et al.  An improved automatic lipreading system to enhance speech recognition , 1988, CHI '88.

[8]  Phil D. Green,et al.  Isolated word recognition of silent speech using magnetic implants and sensors. , 2010, Medical engineering & physics.

[9]  Tomoki Toda,et al.  Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech , 2012, Speech Commun..

[10]  S. D. Jong SIMPLS: an alternative approach to partial least squares regression , 1993 .

[11]  Tanja Schultz,et al.  Analysis of phone confusion in EMG-based speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  F. Guenther,et al.  Classification of Intended Phoneme Production from Chronic Intracortical Microelectrode Recordings in Speech-Motor Cortex , 2011, Front. Neurosci..

[13]  Peter Birkholz,et al.  Control concepts for articulatory speech synthesis , 2007, SSW.

[14]  Tanja Schultz,et al.  Brain-to-text: decoding spoken phrases from phone representations in the brain , 2015, Front. Neurosci..

[15]  Man Mohan Sondhi,et al.  Techniques for estimating vocal-tract shapes from the speech signal , 1994, IEEE Trans. Speech Audio Process..

[16]  Tanja Schultz,et al.  Conversion from facial myoelectric signals to speech: a unit selection approach , 2014, INTERSPEECH.

[17]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[18]  Jun Cai,et al.  Vocal tract imaging system for post-laryngectomy voice replacement , 2013, 2013 IEEE International Instrumentation and Measurement Technology Conference (I2MTC).

[19]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Peter Birkholz,et al.  A three-dimensional model of the vocal tract for speech synthesis , 2003 .

[21]  Mikihiro Nakagiri,et al.  Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Marek Wester,et al.  Unspoken Speech - Speech Recognition based on Electroencephalography , 2006 .

[23]  Olov Engwall,et al.  The acoustic to articulation mapping: non-linear or non-unique? , 2008, INTERSPEECH.

[24]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[25]  B. Atal,et al.  Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique. , 1978, The Journal of the Acoustical Society of America.

[26]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[27]  Phil D. Green,et al.  A User-centric Design of Permanent Magnetic Articulography based Assistive Speech Technology , 2015, BIOSIGNALS.

[28]  Bhiksha Raj,et al.  Synthesizing speech from Doppler signals , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Phil D. Green,et al.  Performance of the MVOCA silent speech interface across multiple speakers , 2013, INTERSPEECH.

[30]  Phil D. Green,et al.  Analysis of phonetic similarity in a silent speech interface based on permanent magnetic articulography , 2014, INTERSPEECH.

[31]  B. V. K. Vijaya Kumar,et al.  Imagined Speech Classification with EEG Signals for Silent Communication: A Preliminary Investigation into Synthetic Telepathy , 2010, 2010 4th International Conference on Bioinformatics and Biomedical Engineering.

[32]  J. M. Gilbert,et al.  Silent speech interfaces , 2010, Speech Commun..

[33]  Gert Cauwenberghs,et al.  Neuromorphic Silicon Neuron Circuits , 2011, Front. Neurosci.

[34]  Gérard Bailly,et al.  Three-dimensional linear articulatory modeling of tongue, lips and face, based on MRI and video images , 2002, J. Phonetics.

[35]  Gérard Chollet,et al.  Statistical Mapping Between Articulatory and Acoustic Data for an Ultrasound-Based Silent Speech Interface , 2011, INTERSPEECH.

[36]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[37]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  Gérard Bailly,et al.  Continuous Articulatory-to-Acoustic Mapping using Phone-based Trajectory HMM for a Silent Speech Interface , 2012, INTERSPEECH.

[39]  Keiichi Tokuda,et al.  Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[40]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[41]  J. M. Gilbert,et al.  Development of a (silent) speech recognition system for patients following laryngectomy. , 2008, Medical engineering & physics.

[42]  Tanja Schultz,et al.  Towards continuous speech recognition using surface electromyography , 2006, INTERSPEECH.

[43]  Tanja Schultz,et al.  Modeling coarticulation in EMG-based continuous speech recognition , 2010, Speech Commun..

[44]  Frank H. Guenther,et al.  Brain-computer interfaces for speech communication , 2010, Speech Commun..

[45]  James T. Heaton,et al.  Towards a practical silent speech recognition system , 2014, INTERSPEECH.

[46]  Tomoki Toda,et al.  A digital signal processor implementation of silent/electrolaryngeal speech enhancement based on real-time statistical voice conversion , 2013, INTERSPEECH.

[47]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[48]  Shrikanth S. Narayanan,et al.  Articulatory synthesis of French connected speech from EMA data , 2013, INTERSPEECH.

[49]  Thomas Baer,et al.  An articulatory synthesizer for perceptual research , 1978 .

[50]  Konstantinos G. Margaritis,et al.  A support vector approach to the acoustic-to-articulatory mapping , 2005, INTERSPEECH.

[51]  Miguel Á. Carreira-Perpiñán,et al.  An empirical investigation of the nonuniqueness in the acoustic-to-articulatory mapping , 2007, INTERSPEECH.

[52]  Gérard Chollet,et al.  Phone recognition from ultrasound and optical video sequences for a silent speech interface , 2008, INTERSPEECH.

[53]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[54]  Kishore Prahallad,et al.  Voice conversion using Artificial Neural Networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[55]  Olov Engwall,et al.  Exploring the Predictability of Non-Unique Acoustic-to-Articulatory Mappings , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[56]  Geoffrey E. Hinton,et al.  The EM algorithm for mixtures of factor analyzers , 1996 .

[57]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[58]  Phil D. Green,et al.  Speech Synthesis Parameter Generation for the Assistive Silent Speech Interface MVOCA , 2011, INTERSPEECH.

[59]  Asterios Toutios,et al.  Estimating the control parameters of an articulatory model from electromagnetic articulograph data. , 2011, The Journal of the Acoustical Society of America.

[60]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[61]  Tanja Schultz,et al.  Further investigations on EMG-to-speech conversion , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[62]  T. W. Anderson An Introduction to Multivariate Statistical Analysis, 2nd Edition. , 1985 .

[63]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[64]  Phil D. Green,et al.  Small-vocabulary speech recognition using a silent speech interface based on magnetic sensing , 2013, Speech Commun..