论文信息 - Imparare a Quantificare Guardando (Learning to Quantify by Watching)

Imparare a Quantificare Guardando (Learning to Quantify by Watching)

English. In this paper, we focus on linguistic questions over images which may be answered with a quantifier (e.g. How many dogs are black? Some/most/all of them, etc.). We show that in order to learn to quantify, a multimodal model has to obtain a genuine understanding of linguistic and visual inputs and of their interaction. We propose a model that extracts a fuzzy representation of the set of the queried objects (e.g. dogs) and of the queried property in relation to that set (e.g. black with respect to dogs), outputting the appropriate quantifier for that relation. Italiano. In questo lavoro studiamo le domande del tipo “Quanti cani sono neri?”, la cui risposta è un quantificatore (es. “alcuni”/”tutti”/”nessuno”). Mostriamo che al fine di imparare a quantificare, un modello multimodale deve ottenere una rappresentazione profonda della domanda linguistica, dell’immagine e della loro interazione. Proponiamo un modello che estrae una rappresentazione approssimativa dell’insieme degli oggetti e della proprietà sui quali verte la domanda.

Sandro Pezzelle | Raffaella Bernardi | Aurélie Herbelot | Ionut Sorodoc

[1] Richard S. Zemel,et al. Image Question Answering: A Visual Semantic Embedding Model and a New Dataset , 2015, ArXiv.

[2] Georgiana Dinu,et al. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[3] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[4] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[5] Andrea Vedaldi,et al. MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[6] Sarah-Jane Leslie,et al. Generics, Prevalence, and Default Inferences , 2009 .

[7] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[8] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[9] Yuandong Tian,et al. Simple Baseline for Visual Question Answering , 2015, ArXiv.

[10] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.