Many-to-many voice conversion based on multiple non-negative matrix factorization

We present in this paper an exemplar-based Voice Conversion (VC) method using Non-negative Matrix Factorization (NMF), which is different from conventional statistical VC. NMF-based VC has advantages of noise robustness and naturalness of converted voice compared to Gaussian Mixture Model (GMM)based VC. However, because NMF-based VC is based on parallel training data of source and target speakers, we cannot convert the voice of arbitrary speakers in this framework. In this paper, we propose a many-to-many VC method that makes use of Multiple Non-negative Matrix Factorization (Multi-NMF). By using Multi-NMF, an arbitrary speaker’s voice is converted to another arbitrary speaker’s voice without the need for any input or output speaker training data. We assume that this method is flexible because we can adopt it to voice quality control or noise robust VC. Index Terms: voice conversion, speech synthesis, many-tomany, exemplar-based, NMF

[1]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[2]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..

[3]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[4]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[5]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Yamato Ohtani,et al.  GMM-based bandwidth extension using sub-band basis spectrum model , 2014, INTERSPEECH.

[7]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[9]  Tetsuya Takiguchi,et al.  A preliminary demonstration of exemplar-based voice conversion for articulation disorders using an individuality-preserving dictionary , 2014, EURASIP J. Audio Speech Music. Process..

[10]  Tetsuya Takiguchi,et al.  Exemplar-based voice conversion in noisy environment , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[11]  Tuomas Virtanen,et al.  Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Tomoki Toda,et al.  Esophageal Speech Enhancement Based on Statistical Voice Conversion with Gaussian Mixture Models , 2010, IEICE Trans. Inf. Syst..

[13]  Tetsuya Takiguchi,et al.  Multimodal exemplar-based voice conversion using lip features in noisy environments , 2014, INTERSPEECH.

[14]  Haizhou Li,et al.  Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Tetsuya Takiguchi,et al.  Noise-Robust Voice Conversion Based on Sparse Spectral Mapping Using Non-negative Matrix Factorization , 2014, IEICE Trans. Inf. Syst..

[16]  Tomoki Toda,et al.  Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech , 2012, Speech Commun..

[17]  Tomoki Toda,et al.  Many-to-many eigenvoice conversion with reference voice , 2009, INTERSPEECH.

[18]  Tetsuya Takiguchi,et al.  Voice conversion based on Non-negative matrix factorization using phoneme-categorized dictionary , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Tomoki Toda,et al.  Eigenvoice conversion based on Gaussian mixture model , 2006, INTERSPEECH.

[21]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Chung-Hsien Wu,et al.  Map-based adaptation for speech conversion using adaptation data selection and non-parallel training , 2006, INTERSPEECH.

[23]  Methods for objective and subjective assessment of quality Subjective quality evaluation of telephone services based on spoken dialogue systems , 2004 .

[24]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[25]  Athanasios Mouchtaris,et al.  Nonparallel training for voice conversion based on a parameter adaptation approach , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Yoshihiko Nankaku,et al.  A mel-cepstral analysis technique restoring high frequency components from low-sampling-rate speech , 2014, INTERSPEECH.

[27]  Keikichi Hirose,et al.  One-to-Many Voice Conversion Based on Tensor Representation of Speaker Space , 2011, INTERSPEECH.