Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze

When speakers describe an image, they tend to look at objects before mentioning them. In this paper, we investigate such sequential cross-modal alignment by modelling the image description generation process computationally. We take as our starting point a state-of-the-art image captioning system and develop several model variants that exploit information from human gaze patterns recorded during language production. In particular, we propose the first approach to image description generation where visual processing is modelled $\textit{sequentially}$. Our experiments and analyses confirm that better descriptions can be obtained by exploiting gaze-driven attention and shed light on human cognitive processes by comparing different ways of aligning the gaze modality with language production. We find that processing gaze data sequentially leads to descriptions that are better aligned to those produced by speakers, more diverse, and more natural${-}$particularly when gaze is encoded with a dedicated recurrent component.

[1]  L. Gleitman,et al.  On the give and take between event apprehension and utterance formulation. , 2007, Journal of memory and language.

[2]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[3]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[4]  Nazli Ikizler-Cinbis,et al.  Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures , 2016, J. Artif. Intell. Res..

[5]  Moreno I. Coco,et al.  Scan pattern in visual scenes predict Sentence production , 2010 .

[6]  Nazli Ikizler-Cinbis,et al.  Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures (Extended Abstract) , 2017, IJCAI.

[7]  Jeff B. Pelz,et al.  SNAG: Spoken Narratives and Gaze Dataset , 2018, ACL.

[8]  Yusuke Sugano,et al.  Seeing with Humans: Gaze-Assisted Neural Image Captioning , 2016, ArXiv.

[9]  Ali Borji,et al.  Human Attention in Image Captioning: Dataset and Analysis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[12]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[13]  Z. Griffin Why Look? Reasons for Eye Movements Related to Language Production. , 2004 .

[14]  Bernt Schiele,et al.  Gaze Embeddings for Zero-Shot Image Classification , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Matthew W. Crocker,et al.  The influence of speaker gaze on listener comprehension: Contrasting visual versus intentional accounts , 2014, Cognition.

[17]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[18]  Aïda Valls,et al.  A Similarity Measure for Sequences of Categorical Data Based on the Ordering of Common Elements , 2008, MDAI.

[19]  E. Miller,et al.  Top-Down Versus Bottom-Up Control of Attention in the Prefrontal and Posterior Parietal Cortices , 2007, Science.

[20]  Moreno I. Coco,et al.  Scan Patterns Predict Sentence Production in the Cross-Modal Processing of Visual Scenes , 2012, Cogn. Sci..

[21]  Moreno I. Coco,et al.  Integrating mechanisms of visual guidance in naturalistic language production , 2015, Cognitive Processing.

[22]  KellerFrank,et al.  Automatic description generation from images , 2016 .

[23]  Antonio Torralba,et al.  Building the gist of a scene: the role of global image features in recognition. , 2006, Progress in brain research.

[24]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and VQA , 2017, ArXiv.

[25]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[26]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Walter Daelemans,et al.  Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource , 2016, LREC.

[28]  Chen Chen,et al.  Improving Image Captioning with Conditional Generative Adversarial Nets , 2018, AAAI.

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Mert Kilickaya,et al.  Re-evaluating Automatic Metrics for Image Captioning , 2016, EACL.

[31]  Luc Van Gool,et al.  Object Referring in Videos with Language and Human Gaze , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Joachim Bingel,et al.  Sequence Classification with Human Attention , 2018, CoNLL.

[33]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[34]  Jongwook Choi,et al.  Supervising Neural Attention Models for Video Captioning by Human Gaze Data , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Joyce Yue Chai,et al.  Incorporating Temporal and Semantic Information with Eye Gaze for Automatic Word Acquisition in Multimodal Conversational Systems , 2008, EMNLP.

[36]  Zenzi M. Griffin,et al.  PSYCHOLOGICAL SCIENCE Research Article WHAT THE EYES SAY ABOUT SPEAKING , 2022 .

[37]  A. L. Yarbus Eye Movements During Perception of Complex Objects , 1967 .

[38]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[39]  Jeff B. Pelz,et al.  Alignment of Eye Movements and Spoken Language for Semantic Image Understanding , 2015, IWCS.

[40]  Yang Feng,et al.  Unsupervised Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Sigrid Klerke,et al.  At a Glance: The Impact of Gaze Aggregation Views on Syntactic Tagging , 2019, EMNLP.

[42]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[44]  Takenobu Tokunaga,et al.  Incrementally Tracking Reference in Human/Human Dialogue Using Linguistic and Extra-Linguistic Information , 2015, NAACL.

[45]  Frank Keller,et al.  Training Object Class Detectors from Eye Tracking Data , 2014, ECCV.

[46]  Joachim Bingel,et al.  Weakly Supervised Part-of-speech Tagging Using Eye-tracking Data , 2016, ACL.

[47]  Fei Liu,et al.  MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.

[48]  Qi Zhao,et al.  Boosted Attention: Leveraging Human Attention for Image Captioning , 2018, ECCV.

[49]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[50]  Emiel Krahmer,et al.  DIDEC: The Dutch Image Description and Eye-tracking Corpus , 2018, COLING.

[51]  G. T. Buswell How People Look At Pictures: A Study Of The Psychology Of Perception In Art , 2012 .

[52]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[53]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.