Multimodal Neural Databases

The rise in loosely-structured data available through text, images, and other modalities has called for new ways of querying them. Multimedia Information Retrieval has filled this gap and has witnessed exciting progress in recent years. Tasks such as search and retrieval of extensive multimedia archives have undergone massive performance improvements, driven to a large extent by recent developments in multimodal deep learning. However, methods in this field remain limited in the kinds of queries they support and, in particular, their inability to answer database-like queries. For this reason, inspired by recent work on neural databases, we propose a new framework, which we name Multimodal Neural Databases (MMNDBs). MMNDBs can answer complex database-like queries that involve reasoning over different input modalities, such as text and images, at scale. In this paper, we present the first architecture able to fulfill this set of requirements and test it with several baselines, showing the limitations of currently available models. The results show the potential of these new techniques to process unstructured data coming from different modalities, paving the way for future research in the area. Code to replicate the experiments will be released at https://github.com/GiovanniTRA/MultimodalNeuralDatabases

[1]  Yann LeCun,et al.  Augmented Language Models: a Survey , 2023, Trans. Mach. Learn. Res..

[2]  S. Savarese,et al.  BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ArXiv.

[3]  Ashish V. Thapliyal,et al.  PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, ICLR.

[4]  Tan Yu,et al.  U-BERT for Fast and Scalable Text-Image Retrieval , 2022, ICTIR.

[5]  Li Dong,et al.  Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks , 2022, ArXiv.

[6]  A. Halevy,et al.  On the Role of Relevance in Natural Language Processing Tasks , 2022, SIGIR.

[7]  Lemao Liu,et al.  Recent Advances in Retrieval-Augmented Text Generation , 2022, SIGIR.

[8]  Tan Yu,et al.  Cross-Probe BERT for Fast Cross-Modal Search , 2022, SIGIR.

[9]  Zirui Wang,et al.  CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..

[10]  Errui Ding,et al.  ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Fei Wang,et al.  Where Does the Performance Improvement Come From?: - A Reproducibility Concern about Image-Text Retrieval , 2022, SIGIR.

[12]  Michael Auli,et al.  data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language , 2022, ICML.

[13]  Jingren Zhou,et al.  OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[14]  Yonatan Bisk,et al.  WebQA: Multihop and Multimodal QA , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Li Dong,et al.  VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts , 2021, NeurIPS.

[16]  Liqiang Nie,et al.  Dynamic Modality Interaction Modeling for Image-Text Retrieval , 2021, SIGIR.

[17]  Hongliang Fei,et al.  Heterogeneous Attention Network for Effective and Efficient Cross-modal Retrieval , 2021, SIGIR.

[18]  Fabrizio Silvestri,et al.  Database reasoning over text , 2021, ACL.

[19]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[20]  Sebastian Riedel,et al.  From Natural Language Processing to Neural Databases , 2021, Proc. VLDB Endow..

[21]  Soujanya Poria,et al.  Retrieving and Reading: A Comprehensive Survey on Open-domain Question Answering , 2021, ArXiv.

[22]  Steven C. H. Hoi,et al.  Photon: A Robust Cross-Domain Text-to-SQL System , 2020, ACL.

[23]  Dezhong Peng,et al.  Scalable Deep Multimodal Learning for Cross-Modal Retrieval , 2019, SIGIR.

[24]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[25]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[26]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[27]  Wei Wang,et al.  A Comprehensive Survey on Cross-modal Retrieval , 2016, ArXiv.

[28]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[29]  Percy Liang,et al.  Compositional Semantic Parsing on Semi-Structured Tables , 2015, ACL.

[30]  Fei Li,et al.  Constructing an Interactive Natural Language Interface for Relational Databases , 2014, Proc. VLDB Endow..

[31]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[32]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[33]  Alessandro Bozzon,et al.  Multimedia and Multimodal Information Retrieval , 2009, SeCO Workshop.

[34]  Oren Etzioni,et al.  Crossing the Structure Chasm , 2003, CIDR.

[35]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Shih-Fu Chang,et al.  Image Retrieval: Current Techniques, Promising Directions, and Open Issues , 1999, J. Vis. Commun. Image Represent..

[37]  Peter Thanisch,et al.  Natural language interfaces to databases – an introduction , 1995, Natural Language Engineering.