Deep Bimodal Regression for Apparent Personality Analysis

Apparent personality analysis from short video sequences is a challenging problem in computer vision and multimedia research. In order to capture rich information from both the visual and audio modality of videos, we propose the Deep Bimodal Regression (DBR) framework. In DBR, for the visual modality, we modify the traditional convolutional neural networks for exploiting important visual cues. In addition, taking into account the model efficiency, we extract audio representations and build the linear regressor for the audio modality. For combining the complementary information from the two modalities, we ensemble these predicted regression scores by both early fusion and late fusion. Finally, based on the proposed framework, we come up with a solution for the Apparent Personality Analysis competition track in the ChaLearn Looking at People challenge in association with ECCV 2016. Our DBR is the winner (first place) of this challenge with 86 registered teams.

[1]  Jianbo Shi,et al.  Force from Motion: Decoding Physical Sensation in a First Person Video , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yoichi Sato,et al.  Recognizing Micro-Actions and Reactions from Paired Egocentric Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Shiguang Shan,et al.  Modeling Video Dynamics with Deep Dynencoder , 2014, ECCV.

[4]  Chenfanfu Jiang,et al.  Inferring Forces and Learning Human Utilities from Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Xiu-Shen Wei,et al.  Selective Convolutional Descriptor Aggregation for Fine-Grained Image Retrieval , 2016, IEEE Transactions on Image Processing.

[6]  Thierry Pun,et al.  DEAP: A Database for Emotion Analysis ;Using Physiological Signals , 2012, IEEE Transactions on Affective Computing.

[7]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[8]  Mohamed R. Amer,et al.  HiRF: Hierarchical Random Field for Collective Activity Recognition in Videos , 2014, ECCV.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Andrew Zisserman,et al.  Flowing ConvNets for Human Pose Estimation in Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[15]  Xiu-Shen Wei,et al.  Mask-CNN: Localizing Parts and Selecting Descriptors for Fine-Grained Image Recognition , 2016, ArXiv.

[16]  Zhigang Deng,et al.  Analysis of emotion recognition using facial expressions, speech and multimodal information , 2004, ICMI '04.

[17]  Mubarak Shah,et al.  Human Pose Estimation in Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[19]  Kristen Grauman,et al.  Intentional Photos from an Unintentional Photographer: Detecting Snap Points in Egocentric Video with a Web Photo Prior , 2014, Mobile Cloud Visual Media Computing.

[20]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[21]  Ming-Hsuan Yang,et al.  Real-Time Exemplar-Based Face Sketch Synthesis , 2014, ECCV.

[22]  Amit K. Roy-Chowdhury,et al.  Continuous Learning of Human Activity Models Using Deep Nets , 2014, ECCV.