Japanese pitch conversion for voice morphing based on differential modeling

Abstract In this paper, we convert the pitch contours predicted by a TTSsystem that models a source speaker to resemble the pitch con-toursofatargetspeaker. Whenthespeakingstylesofthespeak-ers are very different, complex conversions such as adding ordeletingpitchpeaksmayberequired. Ourmethoddoesthecon-versions by modeling the direct pitch features and differentialpitch features at the same time based on linguistic features. Thedifferential pitch features are calculated from matched pairs ofsource and target pitch values. We show experimental results inwhich the target speaker’s characteristics are successfully mod-eledbasedonaverylimitedtrainingcorpus. TheproposedpitchconversionmethodstretchesthepossibilitiesofTTScustomiza-tion for various speaking styles. Index Terms : pitch conversion, voice conversion, voice mor-phing, speech synthesis, differential modeling. 1. Introduction Voice conversion changes the characteristics of the voice of anSPS (SPeaker Source) to those of an SPT (SPeaker Target) forvarious applications. One important application is to build cus-tomized text-to-speech (TTS) systems for different companies,soaTTSsystemwitheachcompany’sfavoritevoicecanbecre-ated quickly and inexpensively by modifying the speech corpusof some original speaker.Spectra and prosody are the two major characteristics ofvoice. For spectral conversion, recent work such as [1, 2]has achieved significant improvements in the naturalness andsimilarity of the voices converted using only a limited amountof training data. However, not much research has been doneinto prosody conversion. Most spectral conversion researchuses simple linear transformations for the prosody. It is truethat the detailed prosody difference is sometimes difficult forlisteners to distinguish [3], especially when the speakers aremonotonously reading for TTS corpus recording. However, togenerate TTS voices with a lively colloquial speaking style, re-production of the detailed prosody characteristics is important.Our objective is to reproduce the SPT’s speaking style ofpitch contours based on limited training data. We assume 100sentences as training data is a reasonable amount to require forthe SPT’s speech corpus. A speech corpus with that size caneasily be recorded in a thirty-minute recording session. We donot assume the existence of a parallel corpus, which can be adifficult condition to satisfy. We focus on the pitch changesaround the syllable level, because the important pitch changesin Japanese are mainly at the syllable level.Figure 1 illustrates examples of pitch contour pairs that theproposed pitch conversion method can handle. They are (a)asymmetrical slope changes, (b) adding or deleting peaks, and(c) adding or deleting phrase-final rises.

[1]  Zhiwei Shuang,et al.  Voice conversion by combining frequency warping with unit selection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Michael Picheny,et al.  The IBM expressive text-to-speech synthesis system for American English , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Zhiwei Shuang,et al.  Frequency warping based on mapping formant parameters , 2006, INTERSPEECH.

[4]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[5]  Daniel Erro,et al.  Weighted frequency warping for voice conversion , 2007, INTERSPEECH.

[6]  Taoufik En-Najjary,et al.  A voice conversion method based on joint pitch and spectral envelope transformation , 2004, INTERSPEECH.

[7]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Yuezhong Tang,et al.  A Parametric Approach for Voice Conversion , 2006 .

[9]  Simon King,et al.  Transforming F0 contours , 2003, INTERSPEECH.

[10]  Ki-Young Lee,et al.  Statistical Pitch Conversion Approaches Based on Korean Accentual Phrases , 2004, PRICAI.

[11]  Yoshihiko Nankaku,et al.  Voice conversion based on simultaneous modelling of spectrum and F0 , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.