Audiovisual Behavior Modeling by Combined Feature Spaces

Great interest is recently shown in behavior modeling, especially in public surveillance tasks. In general it is agreed upon the benefits of use of several input cues as audio and video. Yet, synchronization and fusion of these information sources remains the main challenge. We therefore show results for a feature space combination, which allows for overall feature space optimization. Audio and video features are thereby firstly derived as low-level-descriptors. Synchronization and feature combination is achieved by multivariate time-series analysis. Test-runs on a database of aggressive, cheerful, intoxicated, nervous, neutral, and tired behavior in an airplane situation show a significant improvement over each single modality.

[1]  Michael Beetz,et al.  A Person and Context Specific Approach for Skin Color Classification , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[2]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[3]  Michael Beetz,et al.  The Contracting Curve Density Algorithm: Fitting Parametric Curve Models to Images Using Local Self-Adapting Separation Criteria , 2004, International Journal of Computer Vision.

[4]  L. Rothkrantz,et al.  Toward an affect-sensitive multimodal human-computer interaction , 2003, Proc. IEEE.

[5]  C. Taylor,et al.  Active shape models - 'Smart Snakes'. , 1992 .

[6]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[7]  Maja Pantic,et al.  Automatic Analysis of Facial Expressions: The State of the Art , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[9]  Björn W. Schuller,et al.  Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles , 2005, INTERSPEECH.

[10]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[11]  Thomas S. Huang,et al.  Facial Expression Recognition from Video Sequences : Temporal and Static Modelling , 2002 .

[12]  Björn W. Schuller,et al.  Timing levels in segment-based speech emotion recognition , 2006, INTERSPEECH.

[13]  Stephan Tschechne,et al.  Learning Robust Objective Functions for Model Fitting in Image Understanding Applications , 2006, BMVC.

[14]  Timothy F. Cootes,et al.  Face Recognition Using Active Appearance Models , 1998, ECCV.