Improving Fundamental Frequency Generation in EMG-to-Speech Conversion Using a Quantization Approach

We present a novel approach to generating fundamental frequency (intonation and voicing) trajectories in an EMG-to-Speech conversion Silent Speech Interface, based on quantizing the EMG-to-F0 mappings target values and thus turning a regression problem into a recognition problem. We present this method and evaluate its performance with regard to the accuracy of the voicing information obtained as well as the performance in generating plausible intonation trajectories within voiced sections of the signal. To this end, we also present a new measure for overall F0 trajectory plausibility, the trajectory-label accuracy (TLAcc), and compare it with human evaluations. Our new F0 generation method achieves a significantly better performance than a baseline approach in terms of voicing accuracy, correlation of voiced sections, trajectory-label accuracy and, most importantly, human evaluations.

[1]  Pattie Maes,et al.  AlterEgo: A Personalized Wearable Silent Speech Interface , 2018, IUI.

[2]  P. Cavanagh,et al.  Electromechanical delay in human skeletal muscle under concentric and eccentric contractions , 1979, European Journal of Applied Physiology and Occupational Physiology.

[3]  Paul S. Heckbert Color image quantization for frame buffer display , 1982, SIGGRAPH.

[4]  Sugato Chakravarty,et al.  Method for the subjective assessment of intermedi-ate quality levels of coding systems , 2001 .

[5]  Inma Hernáez,et al.  Harmonics Plus Noise Model Based Vocoder for Statistical Parametric Speech Synthesis , 2014, IEEE Journal of Selected Topics in Signal Processing.

[6]  Tanja Schultz,et al.  The EMG-UKA corpus for electromyographic speech processing , 2014, INTERSPEECH.

[7]  Peter Birkholz,et al.  Non-Invasive Silent Phoneme Recognition Using Microwave Signals , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[9]  Tanja Schultz,et al.  Session-independent EMG-based Speech Recognition , 2011, BIOSIGNALS.

[10]  Tanja Schultz,et al.  Towards continuous speech recognition using surface electromyography , 2006, INTERSPEECH.

[11]  Satoshi Imai,et al.  Cepstral analysis synthesis on the mel frequency scale , 1983, ICASSP.

[12]  Phil D. Green,et al.  A silent speech system based on permanent magnet articulography and direct synthesis , 2016, Comput. Speech Lang..

[13]  L. Maier-Hein,et al.  Session independent non-audible speech recognition using surface electromyography , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[14]  Matthias Janke,et al.  EMG-to-Speech: Direct Generation of Speech From Facial Electromyographic Signals , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Sebastian Kraft,et al.  BeaqleJS : HTML 5 and JavaScript based Framework for the Subjective Evaluation of Audio Quality , 2014 .

[16]  Tanja Schultz,et al.  Towards direct speech synthesis from ECoG: A pilot study , 2016, 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[17]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[18]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[19]  Gábor Gosztolya,et al.  F0 Estimation for DNN-Based Ultrasound Silent Speech Interfaces , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Tomoki Toda,et al.  NAM-to-speech conversion with Gaussian mixture models , 2005, INTERSPEECH.

[21]  Tanja Schultz,et al.  Direct conversion from facial myoelectric signals to speech using Deep Neural Networks , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[22]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[23]  Laurent Girin,et al.  Automatic animation of an articulatory tongue model from ultrasound images of the vocal tract , 2017, Speech Commun..

[24]  Tanja Schultz,et al.  Investigating Objective Intelligibility in Real-Time EMG-to-Speech Conversion , 2018, INTERSPEECH.

[25]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.