OPI at SemEval-2023 Task 1: Image-Text Embeddings and Multimodal Information Retrieval for Visual Word Sense Disambiguation

The goal of visual word sense disambiguation is to find the image that best matches the provided description of the word's meaning. It is a challenging problem, requiring approaches that combine language and image understanding. In this paper, we present our submission to SemEval 2023 visual word sense disambiguation shared task. The proposed system integrates multimodal embeddings, learning to rank methods, and knowledge-based approaches. We build a classifier based on the CLIP model, whose results are enriched with additional information retrieved from Wikipedia and lexical databases. Our solution was ranked third in the multilingual task and won in the Persian track, one of the three language subtasks.

[1]  Gabriel Ilharco,et al.  Reproducible Scaling Laws for Contrastive Language-Image Learning , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  F. Carlsson,et al.  Cross-lingual and Multilingual CLIP , 2022, LREC.

[3]  Jiecao Chen,et al.  WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning , 2021, SIGIR.

[4]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[5]  Radu Soricut,et al.  Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Ahmadreza Mosallanezhad,et al.  ParsiNLU: A Suite of Language Understanding Challenges for Persian , 2020, Transactions of the Association for Computational Linguistics.

[7]  Jörg Tiedemann,et al.  OPUS-MT – Building open translation services for the World , 2020, EAMT.

[8]  Iryna Gurevych,et al.  Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation , 2020, EMNLP.

[9]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[10]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[11]  Francis Bond,et al.  Linking and Extending an Open Multilingual Wordnet , 2013, ACL.

[12]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[13]  Francis Bond,et al.  A Survey of WordNets and their Licenses , 2011 .

[14]  Christopher J. C. Burges,et al.  From RankNet to LambdaRank to LambdaMART: An Overview , 2010 .

[15]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[16]  Antonio Toral,et al.  Rejuvenating the Italian WordNet: upgrading, standardising, extending , 2009 .

[17]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.