Emotional Speech Synthesis using Subspace Constraints in Prosody

An efficient speech synthesis method that uses subspace constraint in prosody is proposed. Conventional unit selection methods concatenate speech segments stored in database, that require enormous number of waveforms in synthesizing various emotional expressions with arbitrary texts. The proposed method employs principal component analysis to reduce the dimensionality of prosodic components, that also allows us to generate new speech that are similar to training samples. The subspace constraint assures that the prosody of the synthesized speech including F0, power, and speech length hold their correlative relation that training samples of emotional speech have. We assume that the combination of the number of syllables and the accent type determines the correlative dynamics of prosody, for each of which we individually construct the subspace. The subspace is then linearly related to emotions by multiple regression analysis that are obtained by subjective evaluation for the training samples. Experimental results demonstrated that only 4 dimensions were sufficient for representing the prosodic changes due to emotion at over 90% of the total variance. Synthesized emotion were successfully recognized by the listeners of the synthesized speech, especially for "anger", "surprise", "disgust", 'sorrow", "boredom", "depression", and "joy"