Dynamic prosody modification using zero frequency filtered signal

Modifying the prosody parameters like pitch, duration and strength of excitation by desired factor is termed as prosody modification. The objective of this work is to develop a dynamic prosody modification method based on zero frequency filtered signal (ZFFS), a byproduct of zero frequency filtering (ZFF). The existing epoch based prosody modification techniques use epochs as pitch markers and the required prosody modification is achieved by the interpolation of epoch intervals plot. Alternatively, this work proposes a method for prosody modification by the resampling of ZFFS. Also the existing epoch based prosody modification method is further refined for modifying the prosodic parameters at every epoch level. Thus providing more flexibility for prosody modification. The general framework for deriving the modified epoch locations can also be used for obtaining the dynamic prosody modification from existing PSOLA and epoch based prosody modification methods. The quality of the prosody modified speech is evaluated using waveforms, spectrograms and subjective studies. The usefulness of the proposed dynamic prosody modification is demonstrated for neutral to emotional conversion task. The subjective evaluations performed for the emotion conversion indicate the effectiveness of the dynamic prosody modification over the fixed prosody modification for emotion conversion. The dynamic prosody modified speech files synthesized using the proposed, epoch based and TD-PSOLA methods are available at http://www.iitg.ac.in/eee/emstlab/demos/demo5.php.

[1]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[2]  Patrick A. Naylor,et al.  Application of the DYPSA algorithm to segmented time scale modification of speech , 2008, 2008 16th European Signal Processing Conference.

[3]  G. Bailly,et al.  Editorial Special Section on Expressive Speech Synthesis , 2006 .

[4]  S. R. Mahadeva Prasanna,et al.  Analysis of excitation source information in emotional speech , 2010, INTERSPEECH.

[5]  B. Yegnanarayana,et al.  Voice conversion: Factors responsible for quality , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  S. R. Mahadeva Prasanna,et al.  Determination of Instants of Significant Excitation in Speech Using Hilbert Envelope and Group Delay Function , 2007, IEEE Signal Processing Letters.

[7]  Luís C. Oliveira,et al.  Pitch-synchronous time-scaling for prosodic and voice quality transformations , 2005, INTERSPEECH.

[8]  Bayya Yegnanarayana,et al.  Characterization of Glottal Activity From Speech Signals , 2009, IEEE Signal Processing Letters.

[9]  Michael Picheny,et al.  The IBM expressive text-to-speech synthesis system for American English , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Aijun Li,et al.  Prosody conversion from neutral speech to emotional speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  S. R. Mahadeva Prasanna,et al.  Neutral to Target Emotion Conversion Using Source and Suprasegmental Information , 2011, INTERSPEECH.

[12]  Manfred R. Schroeder,et al.  Bandwidth compression of speech by analytic-signal rooting , 1967 .

[13]  Paul Taylor,et al.  Text-to-Speech Synthesis , 2009 .

[14]  Korin Richmond,et al.  Informed blending of databases for emotional speech synthesis , 2005, INTERSPEECH.

[15]  Bayya Yegnanarayana,et al.  Duration modification using glottal closure instants and vowel onset points , 2009, Speech Commun..

[16]  Bayya Yegnanarayana,et al.  Determination of instants of significant excitation in speech using group delay function , 1995, IEEE Trans. Speech Audio Process..

[17]  Bayya Yegnanarayana,et al.  Prosody modification using instants of significant excitation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Bayya Yegnanarayana,et al.  Epoch Extraction From Speech Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Eric Moulines,et al.  Non-parametric techniques for pitch-scale and time-scale modification of speech , 1995, Speech Commun..

[20]  B. Yegnanarayana,et al.  Fast prosody modification using instants of significant excitation , 2010 .

[21]  Bayya Yegnanarayana,et al.  Voiced/Nonvoiced Detection Based on Robustness of Voiced Epochs , 2010, IEEE Signal Processing Letters.

[22]  Hung-Yan Gu,et al.  A MANDARIN-SYLLABLE SIGNAL SYNTHESIS METHOD WITH INCREASED FLEXIBILITY IN DURATION, TONE AND TIMBRE CONTROL , 1998 .

[23]  国家科学委員会 Proceedings of the National Science Council, Republic of China , 1969 .

[24]  Dirk Heylen,et al.  Generating expressive speech for storytelling applications , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  M. Portnoff,et al.  Time-scale modification of speech based on short-time Fourier analysis , 1981 .

[26]  Thomas F. Quatieri,et al.  Shape invariant time-scale and pitch modification of speech , 1992, IEEE Trans. Signal Process..