Addressing Multimodality in Overt Aggression Detection

Automatic detection of aggressive situations has a high societal and scientific relevance. It has been argued that using data from multimodal sensors as for example video and sound as opposed to unimodal is bound to increase the accuracy of detections. We approach the problem of multimodal aggression detection from the viewpoint of a human observer and try to reproduce his predictions automatically. Typically, a single ground truth for all available modalities is used when training recognizers. We explore the benefits of adding an extra level of annotations, namely audio-only and video-only. We analyze these annotations and compare them to the multimodal case in order to have more insight into how humans reason using multimodal data. We train classifiers and compare the results when using unimodal and multimodal labels as ground truth. Both in the case of audio and video recognizer the performance increases when using the unimodal labels.

[1]  A. Batliner,et al.  Does multimodality really help? the classification of emotion and of On/Off-focus in multimodal dialogues - two case studies. , 2007, ELMAR 2007.

[2]  Zhenke Yang,et al.  Multi-modal aggression detection in trains , 2009 .

[3]  P. Ekman,et al.  Handbook of methods in nonverbal behavior research , 1982 .

[4]  Léon J. M. Rothkrantz,et al.  Emotion Recognition from Speech by Combining Databases and Fusion of Classifiers , 2010, TSD.

[5]  Jesper Jensen,et al.  MMSE based noise PSD tracking with low complexity , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  K. Scherer,et al.  Vocal expression of affect , 2005 .

[7]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Léon J. M. Rothkrantz,et al.  Automatic aggression detection inside trains , 2010, 2010 IEEE International Conference on Systems, Man and Cybernetics.

[9]  Michael Kipp Spatiotemporal Coding in ANVIL , 2008, LREC.

[10]  Roddy Cowie,et al.  Multimodal databases of everyday emotion: facing up to complexity , 2005, INTERSPEECH.

[11]  K. Scherer,et al.  The New Handbook of Methods in Nonverbal Behavior Research , 2008 .

[12]  Johannes D. Krijnders,et al.  CASSANDRA: audio-video sensor fusion for aggression detection , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.