Neutral to anger speech conversion using non-uniform duration modification

In this paper, the non-uniform duration modification is exploited along with other prosody features for neutral speech to anger speech conversion. The non-uniform duration modification method modifies the durations of vowel and pause segments by different modification factors. Vowel segments are modified by factors based on their identities, and pause segments by uniform factors. Consonant and transition segments are not modified. These modification factors are derived from the analysis of neutral and anger speech. For this purpose, a well known Indian database named as the Indian Institute of Technology Kharagpur Simulated Emotion Speech Corpus (IITKGP-SESC) is chosen for analysis of emotions and synthesis of emotions from neutral speech. The prosodic features used in this study for emotion conversion are pitch contour, intensity contour, and duration contour. Subjective listening test results show that the effectiveness of perception of emotion is better in case of non-uniform duration modification than uniform duration modification.

[1]  K. Sreenivasa Rao,et al.  Non-uniform time scale modification using instants of significant excitation and vowel onset points , 2013, Speech Commun..

[2]  S. R. Mahadeva Prasanna,et al.  Vowel Onset Point Detection Using Source, Spectral Peaks, and Modulation Spectrum Energies , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Astrid Paeschke,et al.  Prosodic Characteristics of Emotional Speech: Measurements of Fundamental Frequency Movements , 2000 .

[4]  Ibon Saratxaga,et al.  Emotion Conversion Based on Prosodic Unit Selection , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Bayya Yegnanarayana,et al.  Event-Based Instantaneous Fundamental Frequency Estimation From Speech Signals , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Shashidhar G. Koolagudi,et al.  IITKGP-SESC: Speech Database for Emotion Analysis , 2009, IC3.

[7]  S. R. Mahadeva Prasanna,et al.  Neutral to Target Emotion Conversion Using Source and Suprasegmental Information , 2011, INTERSPEECH.

[8]  Bayya Yegnanarayana,et al.  Epoch Extraction From Speech Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  A. Vuppala,et al.  Improved vowel onset point detection using epoch intervals , 2012 .

[10]  Marc Schröder,et al.  Expressing degree of activation in synthetic speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Nick Campbell,et al.  A corpus-based speech synthesis system with emotion , 2003, Speech Commun..

[12]  Aijun Li,et al.  Prosody conversion from neutral speech to emotional speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Inma Hernáez,et al.  An objective and subjective study of the role of semantics and prosodic features in building corpora for emotional TTS , 2006, IEEE Transactions on Audio, Speech, and Language Processing.