Real-time vibration control of an electrolarynx based on statistical F0 contour prediction

An electrolarynx is a speaking aid device to artificially generate excitation sounds to help laryngectomees produce electrolaryngeal (EL) speech. Although EL speech is quite intelligible, its naturalness significantly suffers from the unnatural fundamental frequency (F0) patterns of the mechanical excitation sounds. To make it possible to produce more naturally sounding EL speech, we have proposed a method to automatically control F0 patterns of the excitation sounds generated from the electrolarynx based on the statistical F0 prediction, which predicts F0 patterns from the produced EL speech in real-time. In our previous work, we have developed a prototype system by implementing the proposed real-time prediction method in an actual, physical electrolarynx, and through the use of the prototype system, we have found that improvements of the naturalness of EL speech yielded by the prototype system tend to be lower than that yielded by the batch-type prediction. In this paper, we examine negative impacts caused by latency of the real-time prediction on the F0 prediction accuracy, and to alleviate them, we also propose two methods, 1) modeling of segmented continuous F0 (CF0) patterns and 2) prediction of forthcoming F0 values. The experimental results demonstrate that 1) the conventional real-time prediction method needs a large delay to predict CF0 patterns and 2) the proposed methods have positive impacts on the real-time prediction.

[1]  Kou Tanaka,et al.  An Enhanced Electrolarynx with Automatic Fundamental Frequency Control based on Statistical Prediction , 2015, ASSETS.

[2]  Kou Tanaka,et al.  Direct F0 control of an electrolarynx based on statistical excitation feature prediction and its evaluation through simulation , 2014, INTERSPEECH.

[3]  Tomoki Toda,et al.  Implementation of Computationally Efficient Real-Time Voice Conversion , 2012, INTERSPEECH.

[4]  Tomoki Toda,et al.  Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech , 2012, Speech Commun..

[5]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[6]  Kenji Matsui,et al.  Development of electrolarynx with hands-free prosody control , 2013, SSW.

[7]  Kou Tanaka,et al.  A Hybrid Approach to Electrolaryngeal Speech Enhancement Based on Noise Reduction and Statistical Excitation Generation , 2014, IEICE Trans. Inf. Syst..

[8]  Tohru Ifukube,et al.  Design of a new electrolarynx having a pitch control function , 1994, Proceedings of 1994 3rd IEEE International Workshop on Robot and Human Communication.

[9]  Klaus J. Kohler Papers in Laboratory Phonology: Macro and micro F 0 in the synthesis of intonation , 1990 .

[10]  Hideki Kasuya,et al.  Development and evaluation of pitch adjustable electrolarynx , 2004, Speech Prosody 2004.

[11]  Tomoki Toda,et al.  Augmented speech production based on real-time statistical voice conversion , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[12]  John Kingston,et al.  Macro and micro F0 in the synthesis of intonation , 1990 .

[13]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[14]  Tomoki Toda,et al.  Alaryngeal Speech Enhancement Based on One-to-Many Eigenvoice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Graham Neubig,et al.  An Evaluation through Simulation of Electrolarynx Control based on Statistical F 0 Prediction for Multiple Speakers , 2014 .

[16]  Tomoki Toda,et al.  A digital signal processor implementation of silent/electrolaryngeal speech enhancement based on real-time statistical voice conversion , 2013, INTERSPEECH.

[17]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[19]  Mikihiro Nakagiri,et al.  Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement , 2012, IEEE Transactions on Audio, Speech, and Language Processing.