Video-based emotion recognition in the wild

Abstract In-the-wild emotion recognition requires dealing with large variances in input signals, multiple sources of noise that will distract the learners, as well as difficult annotation and ground truth acquisition conditions. In this chapter, we briefly survey the latest developments in multimodal approaches for video-based emotion recognition in the wild, and describe our approach to the problem. For the visual modality, we propose using summarizing functionals of complementary visual descriptors. For the audio modality, we propose a standard computational pipeline for paralinguistics. We combine audio and visual features with least squares regression-based classifiers and weighted score-level fusion. We report state-of-the-art results on the EmotiW Challenge for “in-the-wild” facial-expression recognition. Our approach scales to other problems, and ranked top in two challenges; the ChaLearn-LAP First Impressions Challenge (ICPR'2016) and ChaLearn-LAP Job Interview Candidate Screening Challenge (CVPR'2017), respectively.