Spectral Mapping Using Prior Re-Estimation of i-Vectors and System Fusion for Voice Conversion

In this paper, we propose a new voice conversion (VC) method using i-vectors which consider low-dimensional representation of speech utterances. An attempt is made to restrict the i-vector variability in the intermediate computation of total variability ($\mathbf {T}$ ) matrix by using a novel approach that uses modified-prior distribution of the intermediate i-vectors. This $\mathbf {T}$-modification improves the speaker individuality conversion. For further improvement of conversion score and to keep a better balance between similarity and quality, band-wise spectrogram fusion between conventional joint density Gaussian mixture model (JDGMM) and i-vector based converted spectrograms is employed. The fused spectrogram retains more spectral details and leverages the complementary merits of each subsystem. Experiments in terms of objective and subjective evaluation are conducted extensively on CMU ARCTIC database. The results show that the proposed technique can produce a better trade-off between similarity and quality score than other state-of-the-art baseline VC methods. Furthermore, it works better than JDGMM in limited VC training data. The proposed VC performs moderately better (both objective and subjective) than mixture of factor analyzer based baseline VC. In addition, the proposed VC provides better quality converted speech as compared to maximum likelihood-GMM VC with dynamic feature constraint.

[1]  Athanasios Mouchtaris,et al.  Nonparallel training for voice conversion based on a parameter adaptation approach , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[3]  Tomi Kinnunen,et al.  Incorporating uncertainty as a Quality Measure in I-Vector Based Language Recognition , 2016, Odyssey.

[4]  Haizhou Li,et al.  Mixture of Factor Analyzers Using Priors From Non-Parallel Speech for Voice Conversion , 2012, IEEE Signal Processing Letters.

[5]  Daniel Erro Eslava Intra-lingual and cross-lingual voice conversion using harmonic plus stochastic models , 2008 .

[6]  Chng Eng Siong,et al.  System fusion for high-performance voice conversion , 2015, INTERSPEECH.

[7]  Daniel Erro,et al.  Voice Conversion Based on Weighted Frequency Warping , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Kong-Aik Lee,et al.  Normalization of total variability matrix for i-vector/PLDA speaker verification , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Haizhou Li,et al.  Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Levent M. Arslan,et al.  Speaker Transformation Algorithm using Segmental Codebooks (STASC) , 1999, Speech Commun..

[11]  Heiga Zen,et al.  Product of Experts for Statistical Parametric Speech Synthesis , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Zhizheng Wu,et al.  On the use of I-vectors and average voice model for voice conversion without parallel data , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[13]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[14]  David A. van Leeuwen,et al.  Fusion of Heterogeneous Speaker Recognition Systems in the STBU Submission for the NIST Speaker Recognition Evaluation 2006 , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[16]  Inma Hernáez,et al.  Parametric Voice Conversion Based on Bilinear Frequency Warping Plus Amplitude Scaling , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Yoshihiko Nankaku,et al.  Voice conversion based on mixtures of factor analyzers , 2006, INTERSPEECH.

[18]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[19]  Daniel Erro,et al.  INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[21]  Hamid Sheikhzadeh,et al.  Voice conversion based on feature combination with limited training data , 2015, Speech Commun..

[22]  Shrikanth S. Narayanan,et al.  Simplified supervised i-vector modeling with application to robust and efficient language identification and speaker verification , 2014, Comput. Speech Lang..

[23]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[24]  Lauri Juvela,et al.  Non-parallel voice conversion using i-vector PLDA: towards unifying speaker verification and transformation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Olivier Rosec,et al.  Voice Conversion Using Dynamic Frequency Warping With Amplitude Scaling, for Parallel or Nonparallel Corpora , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Jia Liu,et al.  Voice conversion with smoothed GMM and MAP adaptation , 2003, INTERSPEECH.

[27]  Haizhou Li,et al.  Total Variability Modeling Using Source-Specific Priors , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Goutam Saha,et al.  On robustness of speech based biometric systems against voice conversion attack , 2015, Appl. Soft Comput..

[29]  Kishore Prahallad,et al.  Spectral Mapping Using Artificial Neural Networks for Voice Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[31]  Moncef Gabbouj,et al.  Voice Conversion Using Dynamic Kernel Partial Least Squares Regression , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[33]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[34]  Tetsuya Takiguchi,et al.  Multiple Non-Negative Matrix Factorization for Many-to-Many Voice Conversion , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Shrikanth S. Narayanan,et al.  Modified-prior i-vector estimation for language identification of short duration utterances , 2014, INTERSPEECH.

[36]  Nobuaki Minematsu,et al.  Statistical Voice Conversion Based on Noisy Channel Model , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Tomoki Toda,et al.  Maximum likelihood voice conversion based on GMM with STRAIGHT mixed excitation , 2006, INTERSPEECH.

[38]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Satoshi Nakamura,et al.  Speaker adaptation and voice conversion by codebook mapping , 1991, 1991., IEEE International Sympoisum on Circuits and Systems.

[40]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[42]  Keikichi Hirose,et al.  One-to-Many Voice Conversion Based on Tensor Representation of Speaker Space , 2011, INTERSPEECH.

[43]  Hui Ye,et al.  Quality-enhanced voice morphing using maximum likelihood transformations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  Li-Rong Dai,et al.  Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[45]  Tomoki Toda,et al.  One-to-Many and Many-to-One Voice Conversion Based on Eigenvoices , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[46]  Ning Xu,et al.  Voice conversion based on Gaussian processes by coherent and asymmetric training with limited training data , 2014, Speech Commun..

[47]  Goutam Saha,et al.  Spectral Features for Synthetic Speech Detection , 2017, IEEE Journal of Selected Topics in Signal Processing.

[48]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.