Speech emotion recognition with combined short and long term features

Utterance-based global statistics and frame-based temporal features have been widely used in speech emotion recognition systems,but these features can not effectively describe all of the emotional information.In this research,segment-based features are extracted and applied with a best segment length for emotion recognition for each emotional state.Further more,a novel neural network model named Global control Elman is proposed to combine the utterance-based features and segment-based features together.Experiments show that the performance of combined features may reach a recognition rate of 66.0%,much higher than obtained by utterance-based features or segment-based features.The recognition rate may be improved by 5.9% and 1.7% respectively,and the confusion between emotional state is also effectively reduced.