A Method for Opinion Classification in Video Combining Facial Expressions and Gestures

Most of the researches dealing with video-based opinion recognition problems employ the combination of data from three different sources: video, audio and text. As a consequence, they are solutions based on complex and language-dependent models. Besides such complexity, it may be observed that these current solutions attain low performance in practical applications. Focusing on overcoming these drawbacks, this work presents a method for opinion classification that uses only video as data source, more precisely, facial expression and body gesture information are extracted from online videos and combined to lead to higher classification rates. The proposed method uses feature encoding strategies to improve data representation and to facilitate the classification task in order to predict user's opinion with high accuracy and independently of the language used in videos. Experiments were carried out using three public databases and three baselines to test the proposed method. The results of these experiments show that, even performing only visual analysis of the videos, the proposed method achieves 16% higher accuracy and precision rates, when compared to baselines that analyze visual, audio and textual data video. Moreover, it is showed that the proposed method may identify emotions in videos whose language is other than the language used for training.

[1]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Zicheng Liu,et al.  Hierarchical Filtered Motion for Action Recognition in Crowded Videos , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[3]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[4]  Caifeng Shan,et al.  Bodily Expression for Automatic Affect Recognition , 2015 .

[5]  P. Bilinski Human action recognition in videos , 2014 .

[6]  Erik Cambria,et al.  Fusing audio, visual and textual clues for sentiment analysis from multimodal content , 2016, Neurocomputing.

[7]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[8]  Rada Mihalcea,et al.  Towards multimodal sentiment analysis: harvesting opinions from the web , 2011, ICMI '11.

[9]  A. Al-Hamadi,et al.  Multimodal affect recognition in spontaneous HCI environment , 2012, 2012 IEEE International Conference on Signal Processing, Communication and Computing (ICSPCC 2012).

[10]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[11]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Verónica Pérez-Rosas,et al.  Multimodal Sentiment Analysis of Spanish Online Videos , 2013, IEEE Intelligent Systems.

[13]  Peter Robinson,et al.  Rendering of Eyes for Eye-Shape Registration and Gaze Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Louis-Philippe Morency,et al.  MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos , 2016, ArXiv.

[15]  Luciano Silva,et al.  Face Analysis in the Wild , 2017, 2017 30th SIBGRAPI Conference on Graphics, Patterns and Images Tutorials (SIBGRAPI-T).

[16]  Amir Zadeh,et al.  Micro-opinion Sentiment Intensity Analysis and Summarization in Online Videos , 2015, ICMI.