Fusion Mappings for Multimodal Affect Recognition

Affect recognition is an inherently multi-modal task that makes it appealing to investigate classifier combination approaches in real world scenarios. Thus a variety of different independent classifiers can be constructed from basically independent features without having to rely on artificial feature views. In this paper we study a large variety of fusion approaches based on a multitude of features that were extracted from audio, video and physiological signals. For this purpose the RECOLA data collection is used and we show how an ensemble of classifiers can outperform the best individual classifier.

[1]  Ludmila I. Kuncheva,et al.  That Elusive Diversity in Classifier Ensembles , 2003, IbPRIA.

[2]  Fernando De la Torre,et al.  Supervised Descent Method and Its Applications to Face Alignment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[4]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[5]  Günther Palm,et al.  Decision templates for the classification of bioacoustic time series , 2003, Inf. Fusion.

[6]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[7]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[8]  Michel F. Valstar,et al.  Local Gabor Binary Patterns from Three Orthogonal Planes for Automatic Facial Expression Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[9]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Say Wei Foo,et al.  Classification of stress in speech using linear and nonlinear features , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[11]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[12]  Markus Kächele,et al.  Inferring Depression and Affect from Application Dependent Meta Knowledge , 2014, AVEC '14.

[13]  Patrick Thiam,et al.  On Annotation and Evaluation of Multi-modal Corpora in Affective Human-Computer Interaction , 2014, MA3HMI@INTERSPEECH.

[14]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[15]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[16]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[17]  Fabien Ringeval,et al.  AV+EC 2015: The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data , 2015, AVEC@ACM Multimedia.

[18]  Robert P. W. Duin,et al.  The combining classifier: to train or not to train? , 2002, Object recognition supported by user interaction for service robots.

[19]  G. Palm,et al.  Learning of Decision Fusion Mappings for Pattern Recognition , 2006 .

[20]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[21]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Markus Kächele,et al.  Using unlabeled data to improve classification of emotional states in human computer interaction , 2013, Journal on Multimodal User Interfaces.

[23]  Sascha Meudt,et al.  Fusion of Audio-visual Features using Hierarchical Classifier Systems for the Recognition of Affective States and the State of Depression , 2014, ICPRAM.

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  L. Lin,et al.  A concordance correlation coefficient to evaluate reproducibility. , 1989, Biometrics.

[26]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[27]  Friedhelm Schwenker,et al.  A Multiple Classifier System Approach for Facial Expressions in Image Sequences Utilizing GMM Supervectors , 2010, 2010 20th International Conference on Pattern Recognition.

[28]  Andrew Zisserman,et al.  Representing shape with a spatial pyramid kernel , 2007, CIVR '07.

[29]  L. Kuncheva,et al.  Combining classifiers: Soft computing solutions. , 2001 .

[30]  C. Nickerson A note on a concordance correlation coefficient to evaluate reproducibility , 1997 .

[31]  L. H. Anauer,et al.  Speech Analysis and Synthesis by Linear Prediction of the Speech Wave , 2000 .

[32]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[33]  Markus Kächele,et al.  Multiple Classifier Systems for the Classification of Audio-Visual Emotional States , 2011, ACII.

[34]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Matti Pietikäinen,et al.  Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Chung-Hsien Wu,et al.  Survey on audiovisual emotion recognition: databases, features, and data fusion strategies , 2014, APSIPA Transactions on Signal and Information Processing.

[37]  Günther Palm,et al.  Multiple classifier combination using reject options and markov fusion networks , 2012, ICMI '12.

[38]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).