Extracting emotion from voice

Understanding of emotion is greatly influenced by inputs like voice, facial expressions, body language. Yet few systems are exploring the broad field of the emotional human interface. No established analytical methods, neither in the field of speech analysis nor image processing, can reliably determine the intended or pure emotion. We have concentrated only on the voice-emotion analysis based on the idea that humans are capable of detecting other human emotional state through voice input without any semantic understanding. We developed a simplified human based emotional model and set of wavelet/cepstrum based software tools for emotion extraction from human voice. Our method applies calculations on short time energy portions that correspond to the words in a sentence. The power calculated in short time windows was included, because it especially emphasizes the difference between normal and angry speech. For the purpose of the general human emotional understanding pattern, 100 English and 50 Japanese sound samples were processed, trying to find the relation between semantic and non-semantic emotional understanding. We divide the voice samples into angry, happy, normal and "not defined" emotional state groups according to the general human understanding of the speech.