A preliminary demonstration of exemplar-based voice conversion for articulation disorders using an individuality-preserving dictionary

We present in this paper a voice conversion (VC) method for a person with an articulation disorder resulting from athetoid cerebral palsy. The movement of such speakers is limited by their athetoid symptoms, and their consonants are often unstable or unclear, which makes it difficult for them to communicate. In this paper, exemplar-based spectral conversion using nonnegative matrix factorization (NMF) is applied to a voice with an articulation disorder. To preserve the speaker’s individuality, we used an individuality-preserving dictionary that is constructed from the source speaker’s vowels and target speaker’s consonants. Using this dictionary, we can create a natural and clear voice preserving their voice’s individuality. Experimental results indicate that the performance of NMF-based VC is considerably better than conventional GMM-based VC.

[1]  Simon King,et al.  Using HMM-based Speech Synthesis to Reconstruct the Voice of Individuals with Degenerative Speech Disorders , 2012, INTERSPEECH.

[2]  Elmar Nöth,et al.  Automatic Speech Recognition Systems for the Evaluation of Voice and Speech Disorders in Head and Neck Cancer , 2010, EURASIP J. Audio Speech Music. Process..

[3]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[4]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Tomoki Toda,et al.  Eigenvoice conversion based on Gaussian mixture model , 2006, INTERSPEECH.

[7]  S. Canale,et al.  Campbell's operative orthopaedics , 1987 .

[8]  Chung-Hsien Wu,et al.  Map-based adaptation for speech conversion using adaptation data selection and non-parallel training , 2006, INTERSPEECH.

[9]  Tuomas Virtanen,et al.  Noise robust exemplar-based connected digit recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Ying Wu,et al.  Capturing human hand motion in image sequences , 2002, Workshop on Motion and Video Computing, 2002. Proceedings..

[11]  Tohru Ifukube,et al.  A basic design of wearable speech synthesizer for voice disorders , 2006 .

[12]  Md. Khayrul Bashar,et al.  Unsupervised Texture Segmentation via Wavelet-based Locally Orderless Images (WLOIs) and SOM , 2003, Computer Graphics and Imaging.

[13]  Tetsuya Takiguchi,et al.  Multimodal speech recognition of a person with articulation disorders using AAM and MAF , 2010, 2010 IEEE International Workshop on Multimedia Signal Processing.

[14]  Tanja Schultz,et al.  Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion , 2008, SLTU.

[15]  Alex Pentland,et al.  Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Tomoki Toda,et al.  Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech , 2012, Speech Commun..

[17]  Tomoki Toda,et al.  GMM-based voice conversion applied to emotional speech synthesis , 2003, INTERSPEECH.

[18]  Tetsuya Takiguchi,et al.  Integration of Metamodel and Acoustic Model for Dysarthric Speech Recognition , 2009, J. Multim..

[19]  Wen Gao,et al.  Large vocabulary sign language recognition based on hierarchical decision trees , 2003, ICMI '03.

[20]  Edward M. Riseman,et al.  TextFinder: An Automatic System to Detect and Recognize Text In Images , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Tomoki Toda,et al.  Speaking aid system for total laryngectomees using voice conversion of body transmitted artificial speech , 2006, INTERSPEECH.

[22]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[23]  Nobuo Ezaki,et al.  Text detection from natural scene images: towards a system for visually impaired persons , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[24]  Tuomas Virtanen,et al.  Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Tetsuya Takiguchi,et al.  GMM-Based Emotional Voice Conversion Using Spectrum and Prosody Features , 2012 .

[26]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..

[27]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[28]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[29]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.