Quality-enhanced voice morphing using maximum likelihood transformations

Voice morphing is a technique for modifying a source speaker's speech to sound as if it was spoken by some designated target speaker. The core process in a voice morphing system is the transformation of the spectral envelope of the source speaker to match that of the target speaker and linear transformations estimated from time-aligned parallel training data are commonly used to achieve this. However, the naive application of envelope transformation combined with the necessary pitch and duration modifications will result in noticeable artifacts. This paper studies the linear transformation approach to voice morphing and investigates these two specific issues. First, a general maximum likelihood framework is proposed for transform estimation which avoids the need for parallel training data inherent in conventional least mean square approaches. Second, the main causes of artifacts are identified as being due to glottal coupling, unnatural phase dispersion and the high spectral variance of unvoiced sounds, and compensation techniques are developed to mitigate these. The resulting voice morphing system is evaluated using both subjective and objective measures. These tests show that the proposed approaches are capable of effectively transforming speaker identity whilst maintaining high quality. Furthermore, they do not require carefully prepared parallel training data

[1]  Saeed Vaseghi,et al.  Formant model estimation and transformation for voice morphing , 2002, INTERSPEECH.

[2]  Hui Ye,et al.  Voice conversion for unknown speakers , 2004, INTERSPEECH.

[3]  A. Gray,et al.  Distance measures for speech processing , 1976 .

[4]  Mark J. T. Smith,et al.  Speech analysis/synthesis and modification using an analysis-by-synthesis/overlap-add sinusoidal model , 1997, IEEE Trans. Speech Audio Process..

[5]  Michael Unser,et al.  B-spline signal processing. I. Theory , 1993, IEEE Trans. Signal Process..

[6]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[7]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[8]  Allen Gersho,et al.  Real-time vector APC speech coding at 4800 bps with adaptive postfiltering , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Thomas F. Quatieri,et al.  Shape invariant time-scale and pitch modification of speech , 1992, IEEE Trans. Signal Process..

[10]  Michael W. Macon,et al.  Control of spectral dynamics in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[11]  Alexander Kain,et al.  High-resolution voice transformation , 2001 .

[12]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[13]  T. Quatieri,et al.  Phase modelling and its application to sinusoidal transform coding , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Hui Ye,et al.  Perceptually weighted linear transformations for voice conversion , 2003, INTERSPEECH.

[15]  O. Cappé,et al.  Regularized estimation of cepstrum envelope from discrete frequency points , 1995, Proceedings of 1995 Workshop on Applications of Signal Processing to Audio and Accoustics.

[16]  F. Itakura Line spectrum representation of linear predictor coefficients of speech signals , 1975 .

[17]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[18]  Levent M. Arslan,et al.  Speaker Transformation Algorithm using Segmental Codebooks (STASC) , 1999, Speech Commun..

[19]  Hui Ye,et al.  High quality voice morphing , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.