Direct F0 control of an electrolarynx based on statistical excitation feature prediction and its evaluation through simulation

An electrolarynx is a device that artificially generates excitation sounds to enable laryngectomees to produce electrolaryngeal (EL) speech. Although proficient laryngectomees can produce quite intelligible EL speech, it sounds very unnatural due to the mechanical excitation produced by the device. To address this issue, we have proposed several EL speech enhancement methods using statistical voice conversion and showed that statistical prediction of excitation parameters, such as F0 patterns, was essential to significantly improve naturalness of EL speech. In these methods, the original EL speech is recorded with a microphone and the enhanced EL speech is presented from a loudspeaker in real time. This framework is effective for telecommunication but it is not suitable to face-to-face conversation because both the original EL speech and the enhanced EL speech are presented to listeners. In this paper, we propose direct F0 control of the electrolarynx based on statistical excitation prediction to develop an EL speech enhancement technique also effective for face-to-face conversation. F0 patterns of excitation signals produced by the electrolarynx are predicted in real time from the EL speech produced by the laryngectomee’s articulation of the excitation signals with previously predicted F0 values. A simulation experiment is conducted to evaluate the effectiveness of the proposed method. The experimental results demonstrate that the proposed method yields significant improvements in naturalness of EL speech while keeping its intelligibility high enough. Index Terms: laryngectomee, electrolarynx, electrolaryngeal speech, statistical excitation prediction, simulation evaluation

[1]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Kou Tanaka,et al.  An evaluation of excitation feature prediction in a hybrid approach to electrolaryngeal speech enhancement , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Tomoki Toda,et al.  Maximum likelihood voice conversion based on GMM with STRAIGHT mixed excitation , 2006, INTERSPEECH.

[4]  Tomoki Toda,et al.  Alaryngeal Speech Enhancement Based on One-to-Many Eigenvoice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[6]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Ian Vince McLoughlin,et al.  Reconstruction of Normal Sounding Speech for Laryngectomy Patients Through a Modified CELP Codec , 2010, IEEE Transactions on Biomedical Engineering.

[8]  Tomoki Toda,et al.  Implementation of Computationally Efficient Real-Time Voice Conversion , 2012, INTERSPEECH.

[9]  Kai Yu,et al.  Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Hanjun Liu,et al.  Enhancement of electrolarynx speech based on auditory masking , 2006, IEEE Transactions on Biomedical Engineering.

[11]  P. C. Pandey,et al.  Real-time enhancement of electrolaryngeal speech by spectral subtraction , 2012, 2012 National Conference on Communications (NCC).

[12]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[13]  Mikihiro Nakagiri,et al.  Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[15]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[16]  Kou Tanaka,et al.  A hybrid approach to electrolaryngeal speech enhancement based on spectral subtraction and statistical voice conversion , 2013, INTERSPEECH.

[17]  Tomoki Toda,et al.  Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech , 2012, Speech Commun..

[18]  Héctor M. Pérez Meana,et al.  Enhancement and Restoration of Alaryngeal Speech Signals , 2006, 16th International Conference on Electronics, Communications and Computers (CONIELECOMP'06).