Robust processing techniques for voice conversion

Abstract Differences in speaker characteristics, recording conditions, and signal processing algorithms affect output quality in voice conversion systems. This study focuses on formulating robust techniques for a codebook mapping based voice conversion algorithm. Three different methods are used to improve voice conversion performance: confidence measures, pre-emphasis, and spectral equalization. Analysis is performed for each method and the implementation details are discussed. The first method employs confidence measures in the training stage to eliminate problematic pairs of source and target speech units that might result from possible misalignments, speaking style differences or pronunciation variations. Four confidence measures are developed based on the spectral distance, fundamental frequency (f0) distance, energy distance, and duration distance between the source and target speech units. The second method focuses on the importance of pre-emphasis in line-spectral frequency (LSF) based vocal tract modeling and transformation. The last method, spectral equalization, is aimed at reducing the differences in the source and target long-term spectra when the source and target recording conditions are significantly different. The voice conversion algorithm that employs the proposed techniques is compared with the baseline voice conversion algorithm with objective tests as well as three subjective listening tests. First, similarity to the target voice is evaluated in a subjective listening test and it is shown that the proposed algorithm improves similarity to the target voice by 23.0%. An ABX test is performed and the proposed algorithm is preferred over the baseline algorithm by 76.4%. In the third test, the two algorithms are compared in terms of the subjective quality of the voice conversion output. The proposed algorithm improves the subjective output quality by 46.8% in terms of mean opinion score (MOS).

[1]  Alexander Kain,et al.  Personalizing a speech synthesizer by voice adaptation , 1998, SSW.

[2]  Levent M. Arslan,et al.  Voice conversion methods for vocal tract and pitch contour modification , 2003, INTERSPEECH.

[3]  Joel R. Crosmer,et al.  Very low bit rate speech coding using the line spectrum pair transformation of the LPC coefficients , 1985 .

[4]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[5]  F. Itakura Line spectrum representation of linear predictor coefficients of speech signals , 1975 .

[6]  Levent M. Arslan,et al.  Voice conversion by codebook mapping of line spectral frequencies and excitation spectrum , 1997, EUROSPEECH.

[7]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[8]  L. Knohl,et al.  Speaker normalization with self-organizing feature maps , 1993, Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan).

[9]  J. L. Flanagan,et al.  PHASE VOCODER , 2008 .

[10]  S. Hiki,et al.  Multidimensional representation of personal quality of vowels and its acoustical correlates , 1973 .

[11]  Thomas F. Quatieri,et al.  Shape invariant time-scale and pitch modification of speech , 1992, IEEE Trans. Signal Process..

[12]  Levent M. Arslan,et al.  Subband based voice conversion , 2002, INTERSPEECH.

[13]  Carlo Drioli Radial Basis Function Networks for Conversion of Sound Spectra , 2001, EURASIP J. Adv. Signal Process..

[14]  F. Itakura,et al.  Minimum prediction residual principle applied to speech recognition , 1975 .

[15]  Sadaoki Furui,et al.  Research of individuality features in speech waves and automatic speaker recognition techniques , 1986, Speech Commun..

[16]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[17]  Donald G. Childers,et al.  Glottal source modeling for voice conversion , 1995, Speech Commun..

[18]  Levent M. Arslan,et al.  Speaker Transformation Algorithm using Segmental Codebooks (STASC) , 1999, Speech Commun..

[19]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[20]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[21]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[22]  Bayya Yegnanarayana,et al.  Transformation of formants for voice conversion using artificial neural networks , 1995, Speech Commun..

[23]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[24]  Eric Moulines,et al.  HNS: Speech modification based on a harmonic+noise model , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Masanobu Abe,et al.  Voice conversion algorithm based on piecewise linear conversion rules of formant frequency and spectrum tilt , 1995, Speech Commun..

[26]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[27]  Yoshinori Sagisaka,et al.  Acoustic characteristics of speaker individuality: Control and conversion , 1995, Speech Commun..

[28]  Kuldip K. Paliwal,et al.  Speech Coding and Synthesis , 1995 .

[29]  D G Childers,et al.  Vocal quality factors: analysis, synthesis, and perception. , 1991, The Journal of the Acoustical Society of America.

[30]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[31]  Thierry Dutoit,et al.  High-quality text-to-speech synthesis : an overview , 2004 .

[32]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[33]  Thomas P. Barnwell,et al.  Perceptual relevance of objectively measured descriptors for speaker characterization , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).