Recognition of Emotions using Energy Based Bimodal Information Fusion and Correlation

Multi-sensor information fusion is a rapidly developing research area which forms the backbone of numerous essential technologies such as intelligent robotic control, sensor networks, video and image processing and many more. In this paper, we have developed a novel technique to analyze and correlate human emotions expressed in voice tone & facial expression. Audio and video streams captured to populate audio and video bimodal data sets to sense the expressed emotions in voice tone and facial expression respectively. An energy based mapping is being done to overcome the inherent heterogeneity of the recorded bi-modal signal. The fusion process uses sampled and mapped energy signal of both modalities�s data stream and further recognize the overall emotional component using Support Vector Machine (SVM) classifier with the accuracy 93.06%.

[1]  Wladyslaw Skarbek,et al.  Face Detection by Discrete Gabor Jets and Reference Graph of Fiducial Points , 2007, RSKT.

[2]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[3]  Rainer Stiefelhagen,et al.  Implementation and evaluation of a constraint-based multimodal fusion system for speech and 3D pointing gestures , 2004, ICMI '04.

[4]  Carlo S. Regazzoni,et al.  From multi-sensor surveillance towards smart interactive spaces , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[5]  Ioannis Pitas,et al.  The eNTERFACE’05 Audio-Visual Emotion Database , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[6]  Vlasta Radová,et al.  An approach to speaker identification using multiple classifiers , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Mei-Chen Yeh,et al.  Multimodal fusion using learned text concepts for image categorization , 2006, MM '06.

[8]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[9]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[10]  Ramesh Jain,et al.  Experiential Sampling for video surveillance , 2003, IWVS '03.

[11]  John R. Smith,et al.  Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues , 2003, EURASIP J. Adv. Signal Process..

[12]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[13]  Larry S. Davis,et al.  Look who's talking: speaker detection using video and audio correlation , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[14]  Edward Y. Chang,et al.  Multimodal information fusion for video concept detection , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[15]  Benoît Maison,et al.  Joint processing of audio and visual information for multimedia indexing and human-computer interaction , 2000, RIAO.