Multimodal Alignment for Affective Content

Humans routinely extract important information from images and videos, relying on their gaze. In contrast, computational systems still have difficulty annotating important visual information in a human-like manner, in part because human gaze is often not included in the modeling process. Human input is also particularly relevant for processing and interpreting affective visual information. To address this challenge, we captured human gaze, spoken language, and facial expressions simultaneously in an experiment with visual stimuli characterized by subjective and affective content. Observers described the content of complex emotional images and videos depicting positive and negative scenarios and also their feelings about the imagery being viewed. We explore patterns of these modalities, for example by comparing the affective nature of participant-elicited linguistic tokens with image valence. Additionally, we expand a framework for generating automatic alignments between the gaze and spoken language modalities for visual annotation of images. Multimodal alignment is challenging due to their varying temporal offset. We explore alignment robustness when images have affective content and whether image valence influences alignment results. We also study if word frequency-based filtering impacts results, with both the unfiltered and filtered scenarios performing better than baseline comparisons, and with filtering resulting in a substantial decrease in alignment error rate. We provide visualizations of the resulting annotations from multimodal alignment. This work has implications for areas such as image understanding, media accessibility, and multimodal data fusion.

[1]  Cecilia Ovesdotter Alm,et al.  Using Co-Captured Face, Gaze, and Verbal Reactions to Images of Varying Emotional Content for Analysis and Semantic Alignment , 2017, AAAI Workshops.

[2]  Jeff B. Pelz,et al.  Fusing eye movements and observer narratives for expert-driven image-region annotations , 2016, ETRA.

[3]  Cordelia Schmid,et al.  Weakly-Supervised Alignment of Video with Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Saif Mohammad,et al.  Using Hashtags to Capture Fine Emotion Categories from Tweets , 2015, Comput. Intell..

[5]  Emmanuel Dellandréa,et al.  LIRIS-ACCEDE: A Video Database for Affective Content Analysis , 2015, IEEE Transactions on Affective Computing.

[6]  Jiebo Luo,et al.  Unsupervised Alignment of Natural Language Instructions with Video Segments , 2014, AAAI.

[7]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[8]  Moreno I. Coco,et al.  The impact of attentional, linguistic, and visual features during object naming , 2013, Front. Psychol..

[9]  Saif Mohammad,et al.  CROWDSOURCING A WORD–EMOTION ASSOCIATION LEXICON , 2013, Comput. Intell..

[10]  Amy Beth Warriner,et al.  Norms of valence, arousal, and dominance for 13,915 English lemmas , 2013, Behavior Research Methods.

[11]  Saif Mohammad,et al.  #Emotional Tweets , 2012, *SEMEVAL.

[12]  Peter D. Turney,et al.  Emotions Evoked by Common Words and Phrases: Using Mechanical Turk to Create an Emotion Lexicon , 2010, HLT-NAACL 2010.

[13]  Joyce Yue Chai,et al.  Between linguistic attention and gaze fixations inmultimodal conversational interfaces , 2009, ICMI-MLMI '09.

[14]  Joyce Yue Chai,et al.  Incorporating Temporal and Semantic Information with Eye Gaze for Automatic Word Acquisition in Multimodal Conversational Systems , 2008, EMNLP.

[15]  Stefan Winkler,et al.  Motion saliency outweighs other low-level features while watching videos , 2008, Electronic Imaging.

[16]  Bernhard Schölkopf,et al.  How to Find Interesting Locations in Video: A Spatiotemporal Interest Point Detector Learned from Human Eye Movements , 2007, DAGM-Symposium.

[17]  Ben Taskar,et al.  Alignment by Agreement , 2006, NAACL.

[18]  Chen Yu,et al.  On the Integration of Grounding Language and Learning Objects , 2004, AAAI.

[19]  Z. Griffin Why Look? Reasons for Eye Movements Related to Language Production. , 2004 .

[20]  Douglas DeCarlo,et al.  Robust clustering of eye movement recordings for quantification of visual interest , 2004, ETRA.

[21]  Chen Yu,et al.  A multimodal learning interface for grounding spoken language in sensory perceptions , 2003, ICMI '03.

[22]  Zenzi M. Griffin,et al.  PSYCHOLOGICAL SCIENCE Research Article WHAT THE EYES SAY ABOUT SPEAKING , 2022 .

[23]  W. Levelt,et al.  Viewing and naming objects: eye movements during noun phrase production , 1998, Cognition.