Direct conversion from facial myoelectric signals to speech using Deep Neural Networks

This paper presents our first results using Deep Neural Networks for surface electromyographic (EMG) speech synthesis. The proposed approach enables a direct mapping from EMG signals captured from the articulatory muscle movements to the acoustic speech signal. Features are processed from multiple EMG channels and are fed into a feed forward neural network to achieve a mapping to the target acoustic speech output. We show that this approach is feasible to generate speech output from the input EMG signal and compare the results to a prior mapping technique based on Gaussian mixture models. The comparison is conducted via objective Mel-Cepstral distortion scores and subjective listening test evaluations. It shows that the proposed Deep Neural Network approach gives substantial improvements for both evaluation criteria.

[1]  Erik J. Scheme,et al.  Myoelectric Signal Classification for Phoneme-Based Speech Recognition , 2007, IEEE Transactions on Biomedical Engineering.

[2]  Lena Maier-Hein,et al.  Articulatory Feature Classification using Surface Electromyography , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[3]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[4]  Tanja Schultz,et al.  Further investigations on EMG-to-speech conversion , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Satoshi Imai,et al.  Cepstral analysis synthesis on the mel frequency scale , 1983, ICASSP.

[6]  Tanja Schultz,et al.  Array-based Electromyographic Silent Speech Interface , 2013, BIOSIGNALS.

[7]  Geoffrey Zweig,et al.  An introduction to computational networks and the computational network toolkit (invited talk) , 2014, INTERSPEECH.

[8]  Tomoki Toda,et al.  NAM-to-speech conversion with Gaussian mixture models , 2005, INTERSPEECH.

[9]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  B. Hudgins,et al.  Hidden Markov model classification of myoelectric signals in speech , 2002 .

[11]  Tanja Schultz,et al.  Synthesizing speech from electromyography using voice transformation techniques , 2009, INTERSPEECH.

[12]  Sebastian Kraft,et al.  BeaqleJS : HTML 5 and JavaScript based Framework for the Subjective Evaluation of Audio Quality , 2014 .

[13]  Tanja Schultz,et al.  Towards continuous speech recognition using surface electromyography , 2006, INTERSPEECH.

[14]  L. Maier-Hein,et al.  Session independent non-audible speech recognition using surface electromyography , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[15]  Toshio Tsuji,et al.  A Speech synthesizer Using Facial EMG Signals , 2008, Int. J. Comput. Intell. Appl..

[16]  Tanja Schultz,et al.  Modeling coarticulation in EMG-based continuous speech recognition , 2010, Speech Commun..

[17]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[18]  J. M. Gilbert,et al.  Silent speech interfaces , 2010, Speech Commun..

[19]  Tanja Schultz,et al.  Estimation of fundamental frequency from surface electromyographic data: EMG-to-F0 , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Tanja Schultz,et al.  Conversion from facial myoelectric signals to speech: a unit selection approach , 2014, INTERSPEECH.

[21]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[22]  Tanja Schultz,et al.  Fundamental frequency generation for whisper-to-audible speech conversion , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Laurent Girin,et al.  Robust articulatory speech synthesis using deep neural networks for BCI applications , 2014, INTERSPEECH.