Physically Constrained Statistical F0 Prediction for Electrolaryngeal Speech Enhancement

Electrolaryngeal (EL) speech produced by a laryngectomee using an electrolarynx to mechanically generate artificial excitation sounds severely suffers from unnatural fundamental frequency (F0) patterns caused by monotonic excitation sounds. To address this issue, we have previously proposed EL speech enhancement systems using statistical F0 pattern prediction methods based on a Gaussian Mixture Model (GMM), making it possible to predict the underlying F0 pattern of EL speech from its spectral feature sequence. Our previous work revealed that the naturalness of the predicted F0 pattern can be improved by incorporating a physically based generative model of F0 patterns into the GMM-based statistical F0 prediction system within a Product-of-Expert framework. However, one drawback of this method is that it requires an iterative procedure to obtain a predicted F0 pattern, making it difficult to realize a real-time system. In this paper, we propose yet another approach to physically based statistical F0 pattern prediction by using a HMM-GMM framework. This approach is noteworthy in that it allows to generate an F0 pattern that is both statistically likely and physically natural without iterative procedures. Experimental results demonstrated that the proposed method was capable of generating F0 patterns more similar to those in normal speech than the conventional GMM-based method.

[1]  Hirokazu Kameoka,et al.  STATISTICAL APPROACH TO FUJISAKI-MODEL PARAMETER ESTIMATION FROM SPEECH SIGNALS AND ITS QUANTITATIVE EVALUATION , 2012 .

[2]  Mikihiro Nakagiri,et al.  Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Kou Tanaka,et al.  Statistical F0 prediction for electrolaryngeal speech enhancement considering generative process of F0 contours within product of experts framework , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Tomoki Toda,et al.  Alaryngeal Speech Enhancement Based on One-to-Many Eigenvoice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Kou Tanaka,et al.  A Hybrid Approach to Electrolaryngeal Speech Enhancement Based on Noise Reduction and Statistical Excitation Generation , 2014, IEICE Trans. Inf. Syst..

[6]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[7]  Tanja Schultz,et al.  Fundamental frequency generation for whisper-to-audible speech conversion , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Ian McLoughlin,et al.  Whisper-to-speech conversion using restricted Boltzmann machine arrays , 2014 .

[9]  Tomoki Toda,et al.  Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech , 2012, Speech Commun..

[10]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[11]  Hirokazu Kameoka,et al.  Probabilistic speech F0 contour model incorporating statistical vocabulary model of phrase-accent command sequence , 2013, INTERSPEECH.

[12]  Hirokazu Kameoka,et al.  A statistical model of speech F0 contours , 2010, SAPA@INTERSPEECH.

[13]  Hirokazu Kameoka,et al.  Generative Modeling of Voice Fundamental Frequency Contours , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.