Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech

An electrolarynx (EL) is a medical device that generates sound source signals to provide laryngectomees with a voice. In this article we focus on two problems of speech produced with an EL (EL speech). One problem is that EL speech is extremely unnatural and the other is that sound source signals with high energy are generated by an EL, and therefore, the signals often annoy surrounding people. To address these two problems, in this article we propose three speaking-aid systems that enhance three different types of EL speech signals: EL speech, EL speech using an air-pressure sensor (EL-air speech), and silent EL speech. The air-pressure sensor enables a laryngectomee to manipulate the F"0 contours of EL speech using exhaled air that flows from the tracheostoma. Silent EL speech is produced with a new sound source unit that generates signals with extremely low energy. Our speaking-aid systems address the poor quality of EL speech using voice conversion (VC), which transforms acoustic features so that it appears as if the speech is uttered by another person. Our systems estimate spectral parameters, F"0, and aperiodic components independently. The result of experimental evaluations demonstrates that the use of an air-pressure sensor dramatically improves F"0 estimation accuracy. Moreover, it is revealed that the converted speech signals are preferred to source EL speech.

[1]  O Laccourreye,et al.  Supracricoid Partial Laryngectomy After Failed Laryngeal Radiation Therapy , 1996, The Laryngoscope.

[2]  Tomoki Toda,et al.  Maximum likelihood voice conversion based on GMM with STRAIGHT mixed excitation , 2006, INTERSPEECH.

[3]  M M Carr,et al.  Communication after laryngectomy: An assessment of quality of life , 2000, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[4]  Kuldip K. Paliwal,et al.  Speech Coding and Synthesis , 1995 .

[5]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  M. Singer,et al.  An Endoscopic Technique for Restoration of Voice after Laryngectomy , 1980, The Annals of otology, rhinology, and laryngology.

[7]  Masami Kuramitsu,et al.  Analytical method for multimode oscillators using the averaged potential , 1983 .

[8]  Yoko Saikachi,et al.  Development and perceptual evaluation of amplitude-based F0 control in electrolarynx speech. , 2009, Journal of speech, language, and hearing research : JSLHR.

[9]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Hayes Martin Rehabilitation of the laryngectomee , 1951 .

[11]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[12]  Yotaro Hatamura,et al.  A voice-generation system using an intramouth vibrator , 2001 .

[13]  Kiyohiro Shikano,et al.  Non-Audible Murmur (NAM) Recognition , 2006, IEICE Trans. Inf. Syst..

[14]  Tomoki Toda,et al.  NAM-to-speech conversion with Gaussian mixture models , 2005, INTERSPEECH.

[15]  Tohru Ifukube,et al.  Design of a new electrolarynx having a pitch control function , 1994, Proceedings of 1994 3rd IEEE International Workshop on Robot and Human Communication.

[16]  Tomoki Toda,et al.  Improving body transmitted unvoiced speech with statistical voice conversion , 2006, INTERSPEECH.

[17]  J. Watson,et al.  Differences in speaking proficiencies in three laryngectomee groups. , 1985, Archives of otolaryngology.

[18]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[19]  A. Jemal,et al.  Cancer Statistics, 2008 , 2008, CA: a cancer journal for clinicians.

[20]  S. Imai,et al.  Mel Log Spectrum Approximation (MLSA) filter for speech synthesis , 1983 .

[21]  Tomoki Toda,et al.  A Speech Communication Aid System for Total Laryngectomees Using Voice Conversion of Body Transmitted Artificial Speech , 2006 .

[22]  Hanjun Liu,et al.  Enhancement of electrolarynx speech based on auditory masking , 2006, IEEE Transactions on Biomedical Engineering.

[23]  Tomoki Toda,et al.  Acoustic compensation methods for body transmitted speech conversion , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Kiyohiro Shikano,et al.  Remodeling of the sensor for non-audible murmur (NAM) , 2005, INTERSPEECH.

[25]  Tomoki Toda,et al.  Technologies for processing body-conducted speech detected with non-audible murmur microphone , 2009, INTERSPEECH.

[26]  Miha Žargi,et al.  Communication after laryngectomy , 2001 .

[27]  R. Weber,et al.  Concurrent chemotherapy and radiotherapy for organ preservation in advanced laryngeal cancer. , 2003, The New England journal of medicine.

[28]  Garrett B. Stanley,et al.  Design and implementation of a hands-free electrolarynx device controlled by neck strap muscle electromyographic activity , 2004, IEEE Transactions on Biomedical Engineering.

[29]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[30]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).