Cascaded Fusion of Dynamic, Spatial, and Textural Feature Sets for Person-Independent Facial Emotion Recognition

Emotion recognition from facial expressions is a highly demanding task, especially in everyday life scenarios. Different sources of artifacts have to be considered in order to successfully extract the intended emotional nuances of the face. The exact and robust detection and orientation of faces impeded by occlusions, inhomogeneous lighting and fast movements is only one difficulty. Another one is the question of selecting suitable features for the application at hand. In the literature, a vast body of different visual features grouped into dynamic, spatial and textural families, has been proposed. These features exhibit different advantages/disadvantages over each other due to their inherent structure, and thus capture complementary information, which is a promising vantage point for fusion architectures. To combine different feature sets and exploit their respective advantages, an adaptive multilevel fusion architecture is proposed. The cascaded approach integrates information on different levels and time scales using artificial neural networks for adaptive weighting of propagated intermediate results. The performance of the proposed architecture is analysed on the GEMEP-FERA corpus as well as on a novel dataset obtained from an unconstrained, spontaneuous human-computer interaction scenario. The obtained performance is superior to single channels and basic fusion techniques.

[1]  H. S. Sheshadri,et al.  On the Classification of Imbalanced Datasets , 2012 .

[2]  J. Russell A circumplex model of affect. , 1980 .

[3]  K. Scherer,et al.  Introducing the Geneva Multimodal Emotion Portrayal (GEMEP) corpus , 2010 .

[4]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Sascha Meudt,et al.  Fusion of Audio-visual Features using Hierarchical Classifier Systems for the Recognition of Affective States and the State of Depression , 2014, ICPRAM.

[6]  Joydeep Ghosh,et al.  Generative Oversampling for Mining Imbalanced Datasets , 2007, DMIN.

[7]  Markus Kächele,et al.  Multiple Classifier Systems for the Classification of Audio-Visual Emotional States , 2011, ACII.

[8]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2009, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Friedhelm Schwenker,et al.  Conditioned Hidden Markov Model Fusion for Multimodal Classification , 2011, INTERSPEECH.

[10]  Maja Pantic,et al.  The first facial expression recognition and analysis challenge , 2011, Face and Gesture 2011.

[11]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[13]  Takeo Kanade,et al.  Comprehensive database for facial expression analysis , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[14]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[15]  Simon Lucey,et al.  Deformable Model Fitting by Regularized Landmark Mean-Shift , 2010, International Journal of Computer Vision.

[16]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[17]  Günther Palm,et al.  Multi-modal Fusion based on classifiers using reject options and Markov Fusion Networks , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[18]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[19]  Markus Kächele,et al.  Using unlabeled data to improve classification of emotional states in human computer interaction , 2013, Journal on Multimodal User Interfaces.

[20]  Yair Weiss,et al.  Learning object detection from a small number of examples: the importance of good features , 2004, CVPR 2004.

[21]  Hongying Meng,et al.  Descriptive temporal template features for visual motion recognition , 2009, Pattern Recognit. Lett..

[22]  Andrew Zisserman,et al.  Representing shape with a spatial pyramid kernel , 2007, CIVR '07.

[23]  Fabio Roli,et al.  Analysis of Linear and Order Statistics Combiners for Fusion of Imbalanced Classifiers , 2002, Multiple Classifier Systems.

[24]  Björn W. Schuller,et al.  AVEC 2011-The First International Audio/Visual Emotion Challenge , 2011, ACII.

[25]  Nadia Bianchi-Berthouze,et al.  Emotion recognition by two view SVM_2K classifier on dynamic facial expression features , 2011, Face and Gesture 2011.