Memotion Analysis through the Lens of Joint Embedding

Joint embedding (JE) is a way to encode multi-modal data into a vector space where text remains as the grounding key and other modalities like image are to be anchored with such keys. Meme is typically an image with embedded text onto it. Although, memes are commonly used for fun, they could also be used to spread hate and fake information. That along with its growing ubiquity over several social platforms has caused automatic analysis of memes to become a widespread topic of research. In this paper, we report our initial experiments on Memotion Analysis problem through joint embeddings. Results are marginally yielding SOTA. Text with Image Joint Embeddings Distribution based compositional word embeddings like Word2vec, GloVe are popular in modern NLP. Consider the king, queen word vector analogy (Fig.1.), which shows how good these word embeddings are at capturing syntactic and semantic regularities in language. However, CNN based image embeddings fail to capture such contextuality (Fig. 1), since they are only source image dependant and do not capture corpus level distributions. Our work is focused towards textual grounding of images and to achieve distributional representation quite similar to word vectors. Memotion Analysis Data and SOTA We use the Memotion data (Sharma et al. 2020), consisting of 10K annotated memes, and OCR extracted text. Memotion analysis has 3 tasksTask A: Sentiment Analysis (positive, negative, neutral). Task B: Emotion Analysis (sarcasm, humour, offense, motivation). Task C: Semantic Class Prediction of emotional sub-classes. Reported SOTA (F1) results on these tasks are 35.47%, 51.84%, and 32.35% respectively. ((Sharma et al. 2020) reports details of SOTA). Learning Joint Embedding Learning text+image JE has gained significant momentum recently. Majority of them are based on some form of Copyright © 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. *Equal contribution. (a) The king, queen analogy (b) An expected joint embedding(JE) space: kingJEqueenJE ≈ boyJE girlJE Figure 1: CNN based image embeddings are unable to capture contextuality like existing word embeddings. Canonical Correlation Analysis (CCA), which finds similarities between two modalities by projecting them into one common vector space. We use two such existing models: i) CLIP (Radford et al. 2021), uses a visual transformer as the image encoder and a transformer as text encoder, to generate JE embeddings. It was pre-trained on a large dataset consisting of over 400M images. ii) Stanford’s Joint Embedding (Kolluru 2019) (StanJE) uses VGG-19 and GLoVe to generate the image and text encodings respectively. They introduce a two-branch embedding network and a triplet loss function, which applies a margin based penalty, to obtain joint representation. It is pre-trained on Flickr30k (Plummer et al. 2016) dataset. These models are trained with sentence-image pairs (SJM). So we re-train both the models with word-image pairs. Further, we fine tune the models using Memotion data+Flickr30k data to obtain JE. Word with Image Joint Embedding (WJM) The smallest meaningful unit language is word. So, anchored word representations along with image encoding should be the expected milestone for JE. We re-train both the models with word-image pairs from Flickr30K and Memotion data. To get meaningful words, we apply choose only those words belonging to specific Parts-of-Speech (POS) noun, verb, adverb, adjective categories. This results in a near-exponential increase in the dataset size, as every image is paired with multiple unique and significant words. Fig. ??, shows that the obtained JE space is able to capture meaningful analogy: boyJEgirlJE ≈ manJE womanJE. Figure 2: Visualising Analogyemulating the king, queen analogy but instead using boy, girl, man, woman. boyJEgirlJE ≈ manJE womanJE. Memotion Analysis using Joint Embeddings Learning joint representations is a new research topic, similarly how to use such learned JEs for several downstream tasks is an open territory for research exploration. CLIP allows zero-shot prediction and StanJE allows fine-tuning mechanism to use trained JEs for a particular task. We find that both the methods are not adequate for the Memotion task. Therefore, we propose new methods Task A is a multi-class problem, so we use a Multi-Layer Perceptron (MLP) with 3 dense layers and a softmax output layer trained using categorical cross-entropy loss. For both the Task B and Task C, a Multi-Task Learning (MTL) framework, similar to (Samghabadi et al. 2020), is applied, which consists of four MLPs (one per sub-task). Binary cross-entropy loss is used for Task B and motivation emotion of Task C, since they are binary prediction problems. Rest of the sub-tasks in Task C are multi-class problems, so we use categorical cross-entropy loss. The code is available at https://github.com/NethraGunti/MemotionAnalysis-Through-The-Lens-Of-Joint-Embedding