Computational framework for fusing eye movements and spoken narratives for image annotation

Despite many recent advances in the field of computer vision, there remains a disconnect between how computers process images and how humans understand them. To begin to bridge this gap, we propose a framework that integrates human-elicited gaze and spoken language to label perceptually important regions in an image. Our work relies on the notion that gaze and spoken narratives can jointly model how humans inspect and analyze images. Using an unsupervised bitext alignment algorithm originally developed for machine translation, we create meaningful mappings between participants’ eye movements over an image and their spoken descriptions of that image. The resulting multimodal alignments are then used to annotate image regions with linguistic labels. The accuracy of these labels exceeds that of baseline alignments obtained using purely temporal correspondence between fixations and words. We also find differences in system performances when identifying image regions using clustering methods that rely on gaze information rather than image features. The alignments produced by our framework can be used to create a database of low-level image features and high-level semantic annotations corresponding to perceptually important image regions. The framework can potentially be applied to any multimodal data stream and to any visual domain. To this end, we provide the research community with access to the computational framework.

[1]  Xin Wang,et al.  Role of domain knowledge in developing user-centered medical-image indexing , 2012, J. Assoc. Inf. Sci. Technol..

[2]  Moreno I. Coco,et al.  Scan Patterns Predict Sentence Production in the Cross-Modal Processing of Visual Scenes , 2012, Cogn. Sci..

[3]  A Pollatsek,et al.  The use of information below fixation in reading and in visual search. , 1993, Canadian journal of experimental psychology = Revue canadienne de psychologie experimentale.

[4]  Md. Monirul Islam,et al.  A review on automatic image annotation techniques , 2012, Pattern Recognit..

[5]  Preethi Vaidyanathan,et al.  Visual-Linguistic Semantic Alignment: Fusing Human Gaze and Spoken Narratives for Image Region Annotation , 2017 .

[6]  Julie C. Sedivy,et al.  Eye movements and spoken language comprehension: Effects of visual context on syntactic ambiguity resolution , 2002, Cognitive Psychology.

[7]  Zenzi M. Griffin,et al.  Why Look? Reasons for Eye Movements Related to Language Production. , 2004 .

[8]  Alex Pentland,et al.  Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..

[9]  Cecilia Ovesdotter Alm,et al.  Using Co-Captured Face, Gaze, and Verbal Reactions to Images of Varying Emotional Content for Analysis and Semantic Alignment , 2017, AAAI Workshops.

[10]  Yifan Peng,et al.  Studying Relationships between Human Gaze, Description, and Computer Vision , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[12]  Jeff B. Pelz,et al.  Visualinguistic Approach to Medical Image Understanding , 2012, AMIA.

[13]  M. Just,et al.  Eye fixations and cognitive processes , 1976, Cognitive Psychology.

[14]  Joyce Yue Chai,et al.  Incorporating Temporal and Semantic Information with Eye Gaze for Automatic Word Acquisition in Multimodal Conversational Systems , 2008, EMNLP.

[15]  J. Shanteau How much information does an expert use? Is it relevant? , 1992 .

[16]  Zenzi M. Griffin,et al.  PSYCHOLOGICAL SCIENCE Research Article WHAT THE EYES SAY ABOUT SPEAKING , 2022 .

[17]  Julie C. Sedivy,et al.  Subject Terms: Linguistics Language Eyes & eyesight Cognition & reasoning , 1995 .

[18]  Nathan Schneider,et al.  Association for Computational Linguistics: Human Language Technologies , 2011 .

[19]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Luc Van Gool,et al.  Object Referring in Videos with Language and Human Gaze , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Luc Van Gool,et al.  Object Referring in Visual Scene with Spoken Language , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[22]  Jordi Pont-Tuset,et al.  Convolutional Oriented Boundaries: From Image Segmentation to High-Level Tasks , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  M. Tanenhaus,et al.  Introduction to the special issue on language–vision interactions , 2007 .

[24]  Pertti Vakkari,et al.  Subject Knowledge, Source of Terms, and Term Selection in Query Expansion: An Analytical Study , 2002, ECIR.

[25]  Jeff B. Pelz,et al.  Fusing eye movements and observer narratives for expert-driven image-region annotations , 2016, ETRA.

[26]  Douglas DeCarlo,et al.  Robust clustering of eye movement recordings for quantification of visual interest , 2004, ETRA.

[27]  Jianfei Cai,et al.  Beyond pixels: A comprehensive survey from bottom-up to semantic image segmentation and cosegmentation , 2015, J. Vis. Commun. Image Represent..

[28]  K. Rayner,et al.  Making and correcting errors during sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences , 1982, Cognitive Psychology.

[29]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[30]  Norman I. Badler,et al.  Temporal scene analysis: conceptual descriptions of object movements. , 1975 .

[31]  C. Lawrence Zitnick,et al.  Adopting Abstract Images for Semantic Scene Understanding , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Ali Farhadi,et al.  Situation Recognition: Visual Semantic Role Labeling for Image Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  F. Quimby What's in a picture? , 1993, Laboratory animal science.

[34]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[35]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[36]  M. Tanenhaus,et al.  Time Course of Frequency Effects in Spoken-Word Recognition: Evidence from Eye Movements , 2001, Cognitive Psychology.

[37]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[38]  Z. Griffin Why Look? Reasons for Eye Movements Related to Language Production. , 2004 .

[39]  W. Levelt,et al.  Viewing and naming objects: eye movements during noun phrase production , 1998, Cognition.

[40]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[41]  Emiel Krahmer,et al.  DIDEC: The Dutch Image Description and Eye-tracking Corpus , 2018, COLING.

[42]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[43]  Moreno I. Coco,et al.  The impact of attentional, linguistic, and visual features during object naming , 2013, Front. Psychol..

[44]  Cecilia Ovesdotter Alm,et al.  Multimodal Alignment for Affective Content , 2018, AAAI Workshops.

[45]  Jana Holsanova,et al.  The Dynamics of Picture Viewing and Picture Description , 2006 .

[46]  Cecilia Ovesdotter Alm,et al.  Object Categorization: Words and Pictures: Categories, Modifiers, Depiction, and Iconography , 2009 .

[47]  Gerd Herzog,et al.  VIsual TRAnslator: Linking perceptions and natural language descriptions , 1994, Artificial Intelligence Review.

[48]  Stephen M. Fiore,et al.  Perceptual (Re)learning: A Leverage Point for Human-Centered Computing , 2007, IEEE Intelligent Systems.

[49]  Li Fei-Fei,et al.  Towards total scene understanding: Classification, annotation and segmentation in an automatic framework , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  A. Murat Tekalp,et al.  Automatic Image Annotation Using Adaptive Color Classification , 1996, CVGIP Graph. Model. Image Process..

[51]  Andrew Zisserman,et al.  OBJCUT: Efficient Segmentation Using Top-Down and Bottom-Up Cues , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Moreno I. Coco,et al.  Sentence Production in Naturalistic Scenes with Referential Ambiguity , 2010 .

[53]  Kate Saenko,et al.  Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild , 2014, COLING.

[54]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[55]  Rohini K. Srihari,et al.  Automatic Indexing and Content-Based Retrieval of Captioned Images , 1995, Computer.

[56]  Antoine Geissbühler,et al.  A Review of Content{Based Image Retrieval Systems in Medical Applications { Clinical Bene(cid:12)ts and Future Directions , 2022 .

[57]  K. Rayner Eye movements in reading and information processing: 20 years of research. , 1998, Psychological bulletin.

[58]  Roger M. Cooper,et al.  The control of eye fixation by the meaning of spoken language: A new methodology for the real-time investigation of speech perception, memory, and language processing. , 1974 .

[59]  Hermann Ney,et al.  Improved Alignment Models for Statistical Machine Translation , 1999, EMNLP.

[60]  Yee Whye Teh,et al.  Names and faces in the news , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[61]  Karen Holtzblatt,et al.  Contextual design , 1997, INTR.

[62]  E. Krupinski,et al.  The importance of perception research in medical imaging. , 2000, Radiation medicine.

[63]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Yejin Choi,et al.  Generalizing Image Captions for Image-Text Parallel Corpus , 2013, ACL.

[65]  Sanja Fidler,et al.  What Are You Talking About? Text-to-Image Coreference , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[66]  Paul R. Smart,et al.  Knowledge Elicitation: Methods, Tools and Techniques , 2015 .

[67]  Daniel C. Richardson,et al.  Looking To Understand: The Coupling Between Speakers' and Listeners' Eye Movements and Its Relationship to Discourse Comprehension , 2005, Cogn. Sci..

[68]  Peter J Pronovost,et al.  Identifying and categorising patient safety hazards in cardiovascular operating rooms using an interdisciplinary approach: a multisite study , 2012, BMJ quality & safety.

[69]  A. Treisman,et al.  A feature-integration theory of attention , 1980, Cognitive Psychology.

[70]  Jiebo Luo,et al.  Unsupervised Alignment of Natural Language Instructions with Video Segments , 2014, AAAI.

[71]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[72]  Jeff B. Pelz,et al.  SNAG: Spoken Narratives and Gaze Dataset , 2018, ACL.

[73]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[74]  Eli Saber,et al.  Probabilistic approach for extracting regions of interest in digital images , 2010, J. Electronic Imaging.

[75]  J. Trueswell,et al.  Interpreting pronouns and demonstratives in Finnish: Evidence for a form-specific approach to reference resolution , 2008 .

[76]  Carla E. Brodley,et al.  ASSERT: A Physician-in-the-Loop Content-Based Retrieval System for HRCT Image Databases , 1999, Comput. Vis. Image Underst..

[77]  Femke F. van der Meulen Coordination of eye gaze and speech in sentence production , 2003 .

[78]  James Ze Wang,et al.  Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[79]  Elizabeth A. Krupinski,et al.  Research and applications: Investigating the link between radiologists' gaze, diagnostic decision, and image content , 2013, J. Am. Medical Informatics Assoc..

[80]  Wenji Mao,et al.  Social Computing: From Social Informatics to Social Intelligence , 2007, IEEE Intell. Syst..

[81]  Michael Gygli,et al.  Fast Object Class Labelling via Speech , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  Chen Yu,et al.  On the Integration of Grounding Language and Learning Objects , 2004, AAAI.

[83]  David L. Waltz Generating and Understanding Scene Descriptions. , 1980 .

[84]  Reynold Bailey,et al.  Fusing Dialogue and Gaze From Discussions of 2D and 3D Scenes , 2019, ICMI.

[85]  Jason Dykes,et al.  Human-Centered Approaches in Geovisualization Design: Investigating Multiple Methods Through a Long-Term Case Study , 2011, IEEE Transactions on Visualization and Computer Graphics.

[86]  Mark Q. Shaw,et al.  Automatic Image Segmentation by Dynamic Region Growth and Multiresolution Merging , 2009, IEEE Transactions on Image Processing.

[87]  Jorma Laaksonen,et al.  Paying Attention to Descriptions Generated by Image Captioning Models , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[88]  Stephen Clark,et al.  Improving Multi-Modal Representations Using Image Dispersion: Why Less is Sometimes More , 2014, ACL.

[89]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[90]  Jeff B. Pelz,et al.  Computational Integration of Human Vision and Natural Language through Bitext Alignment , 2015, VL@EMNLP.

[91]  Deb Roy,et al.  Integration of speech and vision using mutual information , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[92]  Philip Heng Wai Leong,et al.  Adapting content-based image retrieval techniques for the semantic annotation of medical images , 2016, Comput. Medical Imaging Graph..

[93]  Zeshu Shao,et al.  Predicting Naming Latencies for Action Pictures: Dutch Norms , 2022 .

[94]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[95]  Chen Yu,et al.  A multimodal learning interface for grounding spoken language in sensory perceptions , 2003, ICMI '03.

[96]  D. Scott Perceptual learning. , 1974, Queen's nursing journal.

[97]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[98]  Gregory J. Zelinsky,et al.  Specifying the relationships between objects, gaze, and descriptions for scene understanding , 2013 .

[99]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[100]  Lamberto Ballan,et al.  Love Thy Neighbors: Image Annotation by Exploiting Image Metadata , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[101]  Ben Taskar,et al.  Alignment by Agreement , 2006, NAACL.

[102]  M A Just,et al.  A theory of reading: from eye fixations to comprehension. , 1980, Psychological review.