Learning the fusion of audio and video aggression assessment by meta-information from human annotations

The focus of this paper is finding a method to predict aggression using a multimodal system, given multiple unimodal features. The mechanism underlying multimodal sensor fusion is complex and not completely clear. We try to understand the process of fusion and make it more transparent. As a case study we use a database with audio-visual recordings of aggressive behavior in trains. We have collected multi- and unimodal assessments by humans, who have given aggression scores on a 3 point scale. There are no trivial fusion steps to predict the multimodal labels from the unimodal labels. We propose an intermediate step to discover the structure in the fusion process. We call these meta-features and we find a set of five which have an impact on the fusion process. Using a propositional rule based learner we show the high positive impact of the meta-features on predicting the multimodal label for the complex situations in which the labels for audio, video and multimodal do not reinforce each other. We continue with an experiment by which we prove the added value of such an approach on the whole data set.