DIDEC: The Dutch Image Description and Eye-tracking Corpus

We present a corpus of spoken Dutch image descriptions, paired with two sets of eye-tracking data: free viewing, where participants look at images without any particular purpose, and description viewing, where we track eye movements while participants produce spoken descriptions of the images they are viewing. This paper describes the data collection procedure and the corpus itself, and provides an initial analysis of self-corrections in image descriptions. We also present two studies showing the potential of this data. Though these studies mainly serve as an example, we do find two interesting results: (1) the eye-tracking data for the description viewing task is more coherent than for the free-viewing task; (2) variation in image descriptions (also called image specificity; Jas and Parikh, 2015) is only moderately correlated across different languages. Our corpus can be used to gain a deeper understanding of the image description task, particularly how visual attention is correlated with the image description process.

[1]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[2]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Roser Morante,et al.  Pragmatic Factors in Image Description: The Case of Negations , 2016, VL@ACL.

[4]  Emiel Krahmer,et al.  Varying image description tasks: spoken versus written descriptions , 2018, VarDial@COLING 2018.

[5]  Qi Zhao,et al.  SALICON: Saliency in Context , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  A. Maes,et al.  Talking about Relations: Factors Influencing the Production of Relational Descriptions , 2016, Front. Psychol..

[7]  Nicola Guarino,et al.  Social Roles and their Descriptions , 2004, KR.

[8]  Albert Gatt,et al.  Reference Production as Search: The Impact of Domain Size on the Production of Distinguishing Descriptions. , 2017, Cognitive science.

[9]  Piek Vossen,et al.  Open Dutch WordNet , 2016, GWC.

[10]  A. L. Yarbus,et al.  Eye Movements and Vision , 1967, Springer US.

[11]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  W. Levelt Speaking: From Intention to Articulation , 1990 .

[13]  Desmond Elliott,et al.  Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description , 2017, WMT.

[14]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[15]  C. Koch,et al.  A saliency-based search mechanism for overt and covert shifts of visual attention , 2000, Vision Research.

[16]  Nazli Ikizler-Cinbis,et al.  Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures , 2016, J. Artif. Intell. Res..

[17]  M. Brysbaert,et al.  Explaining human performance in psycholinguistic tasks with models of semantic similarity based on prediction and counting : A review and empirical validation , 2017 .

[18]  Moreno I. Coco,et al.  Scan Patterns Predict Sentence Production in the Cross-Modal Processing of Visual Scenes , 2012, Cogn. Sci..

[19]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[20]  Moreno I. Coco,et al.  Classification of visual and linguistic tasks using eye-movement features. , 2014, Journal of vision.

[21]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[22]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[23]  Khalil Sima'an,et al.  Multi30K: Multilingual English-German Image Descriptions , 2016, VL@ACL.

[24]  Devi Parikh,et al.  Image specificity , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[27]  Jianxiong Xiao,et al.  What makes an image memorable , 2011 .

[28]  Victoria A. Fromkin,et al.  The Non-Anomalous Nature of Anomalous Utterances , 1971 .

[29]  Thiago Castro Ferreira,et al.  Task demands and individual variation in referring expressions , 2016, INLG.

[30]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[31]  W. Levelt,et al.  Monitoring and self-repair in speech , 1983, Cognition.

[32]  R. C. Langford How People Look at Pictures, A Study of the Psychology of Perception in Art. , 1936 .

[33]  Erhard W. Hinrichs,et al.  GernEdiT - The GermaNet Editing Tool , 2010, LREC.

[34]  Frédo Durand,et al.  What Do Different Evaluation Metrics Tell Us About Saliency Models? , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  G.H.J. Drieman Differences between written and spoken language: An exploratory study , 1962 .

[36]  Helmut Feldweg,et al.  GermaNet - a Lexical-Semantic Net for German , 1997 .

[37]  A. Clark Whatever next? Predictive brains, situated agents, and the future of cognitive science. , 2013, The Behavioral and brain sciences.

[38]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[39]  Piek T. J. M. Vossen,et al.  Cross-linguistic differences and similarities in image descriptions , 2017, INLG.