SpatialVOC2K: A Multilingual Dataset of Images with Annotations and Features for Spatial Relations between Objects

We present SpatialVOC2K, the first multilingual image dataset with spatial relation annotations and object features for image-to-text generation, built using 2,026 images from the PASCAL VOC2008 dataset. The dataset incorporates (i) the labelled object bounding boxes from VOC2008, (ii) geometrical, language and depth features for each object, and (iii) for each pair of objects in both orders, (a) the single best preposition and (b) the set of possible prepositions in the given language that describe the spatial relationship between the two objects. Compared to previous versions of the dataset, we have roughly doubled the size for French, and completely reannotated as well as increased the size of the English portion, providing single best prepositions for English for the first time. Furthermore, we have added explicit 3D depth features for objects. We are releasing our dataset for free reuse, along with evaluation tools to enable comparative evaluation.

[1]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[2]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.

[3]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[4]  Frank Keller,et al.  Image Description using Visual Dependency Representations , 2013, EMNLP.

[5]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[6]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[7]  Francesc Moreno-Noguer,et al.  Combining Geometric, Textual and Visual Features for Predicting Prepositions in Image Descriptions , 2015, EMNLP.

[8]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[9]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[10]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[11]  Adrian Muscat,et al.  Learning to Generate Descriptions of Visual Data Anchored in Spatial Relations , 2017, IEEE Computational Intelligence Magazine.

[12]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Adrian Muscat,et al.  Generating Descriptions of Spatial Relations between Objects in Images , 2015, ENLG.

[15]  Adrian Muscat,et al.  Adding the Third Dimension to Spatial Relation Detection in 2D Images , 2018, INLG.