Many-to-one voice conversion using exemplar-based sparse representation

Voice conversion (VC) is being widely researched in the field of speech processing because of increased interest in using such processing in applications such as personalized Text-to-Speech systems. We present in this paper a many-to-one VC method using exemplar-based sparse representation, which is different from conventional statistical VC. In our previous exemplar-based VC method, input speech was represented by the source dictionary and its sparse coefficients. The source and the target dictionaries are fully coupled and the converted voice is constructed from the source coefficients and the target dictionary. This method requires parallel exemplars (which consist of the source exemplars and target exemplars that have the same texts uttered by the source and target speakers) for dictionary construction. In this paper, we propose a many-to-one VC method in an exemplar-based framework which does not need training data of the source speaker. Some statistical approaches for many-to-one VC have been proposed; however, in the framework of exemplar-based VC, such a method has never been proposed. The effectiveness of our many-to-one VC has been confirmed by comparing its effectiveness with that of a conventional one-to-one NMF-based method and one-to-one GMM-based method.

[1]  Tetsuya Takiguchi,et al.  Multimodal exemplar-based voice conversion using lip features in noisy environments , 2014, INTERSPEECH.

[2]  Tomoki Toda,et al.  Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech , 2012, Speech Commun..

[3]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..

[4]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[5]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[6]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Chung-Hsien Wu,et al.  Map-based adaptation for speech conversion using adaptation data selection and non-parallel training , 2006, INTERSPEECH.

[8]  Tetsuya Takiguchi,et al.  Voice conversion based on Non-negative matrix factorization using phoneme-categorized dictionary , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Tetsuya Takiguchi,et al.  Exemplar-based voice conversion in noisy environment , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[10]  Yamato Ohtani,et al.  GMM-based bandwidth extension using sub-band basis spectrum model , 2014, INTERSPEECH.

[11]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Athanasios Mouchtaris,et al.  Nonparallel training for voice conversion based on a parameter adaptation approach , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Methods for objective and subjective assessment of quality Subjective quality evaluation of telephone services based on spoken dialogue systems , 2004 .

[14]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[15]  Tetsuya Takiguchi,et al.  Exemplar-Based Voice Conversion Using Sparse Representation in Noisy Environments , 2013, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[16]  Haizhou Li,et al.  Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Yoshihiko Nankaku,et al.  A mel-cepstral analysis technique restoring high frequency components from low-sampling-rate speech , 2014, INTERSPEECH.

[18]  Tomoki Toda,et al.  Eigenvoice conversion based on Gaussian mixture model , 2006, INTERSPEECH.

[19]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[20]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[21]  Tetsuya Takiguchi,et al.  A preliminary demonstration of exemplar-based voice conversion for articulation disorders using an individuality-preserving dictionary , 2014, EURASIP J. Audio Speech Music. Process..

[22]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Tetsuya Takiguchi,et al.  Noise-Robust Voice Conversion Based on Sparse Spectral Mapping Using Non-negative Matrix Factorization , 2014, IEICE Trans. Inf. Syst..

[24]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[25]  Tuomas Virtanen,et al.  Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Tomoki Toda,et al.  Esophageal Speech Enhancement Based on Statistical Voice Conversion with Gaussian Mixture Models , 2010, IEICE Trans. Inf. Syst..

[27]  Keikichi Hirose,et al.  One-to-Many Voice Conversion Based on Tensor Representation of Speaker Space , 2011, INTERSPEECH.