Fusing eye movements and observer narratives for expert-driven image-region annotations

Human image understanding is reflected by individuals' visual and linguistic behaviors, but the meaningful computational integration and interpretation of their multimodal representations remain a challenge. In this paper, we expand a framework for capturing image-region annotations in dermatology, a domain in which interpreting an image is influenced by experts' visual perception skills, conceptual domain knowledge, and task-oriented goals. Our work explores the hypothesis that eye movements can help us understand experts' perceptual processes and that spoken language descriptions can reveal conceptual elements of image inspection tasks. We cast the problem of meaningfully integrating visual and linguistic data as unsupervised bitext alignment. Using alignment, we create meaningful mappings between physicians' eye movements, which reveal key areas of images, and spoken descriptions of those images. The resulting alignments are then used to annotate image regions with medical concept labels. Our alignment accuracy exceeds baselines using both exact and delayed temporal correspondence. Additionally, comparison of alignment accuracy between a method that identifies clusters in the images based on eye movement vs. a method that identifies clusters using image features suggests that the two approaches perform well on different types of images and concept labels. This suggests that an image annotation framework should integrate information from more than one technique to handle heterogeneous images. We also investigate the performance of the proposed aligner for dermatological primary morphology concept labels, as well as for lesion size or type and distribution-based categories of images.

[1]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[2]  D. Scott Perceptual learning. , 1974, Queen's nursing journal.

[3]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Femke F. van der Meulen Coordination of eye gaze and speech in sentence production , 2003 .

[5]  James Ze Wang,et al.  Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Gregory J. Zelinsky,et al.  Specifying the relationships between objects, gaze, and descriptions for scene understanding , 2013 .

[7]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[8]  Zenzi M. Griffin,et al.  Why Look? Reasons for Eye Movements Related to Language Production. , 2004 .

[9]  Ying Liu,et al.  A survey of content-based image retrieval with high-level semantics , 2007, Pattern Recognit..

[10]  Kate Saenko,et al.  Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild , 2014, COLING.

[11]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[12]  Antoine Geissbühler,et al.  A Review of Content{Based Image Retrieval Systems in Medical Applications { Clinical Bene(cid:12)ts and Future Directions , 2022 .

[13]  Antonio Torralba,et al.  Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. , 2006, Psychological review.

[14]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  J. Shanteau How much information does an expert use? Is it relevant? , 1992 .

[16]  Karen Holtzblatt,et al.  Contextual design , 1997, INTR.

[17]  E. Krupinski,et al.  The importance of perception research in medical imaging. , 2000, Radiation medicine.

[18]  Dong Wang,et al.  Using human experts' gaze data to evaluate image processing algorithms , 2011, 2011 IEEE 10th IVMSP Workshop: Perception and Visual Signal Analysis.

[19]  Ben Taskar,et al.  Alignment by Agreement , 2006, NAACL.

[20]  Jeff B. Pelz,et al.  Disfluencies as Extra-Propositional Indicators of Cognitive Processing , 2012, ExProM@ACL.

[21]  Douglas DeCarlo,et al.  Robust clustering of eye movement recordings for quantification of visual interest , 2004, ETRA.

[22]  Dana H. Ballard,et al.  A multimodal learning interface for grounding spoken language in sensory perceptions , 2004, ACM Trans. Appl. Percept..

[23]  Karen Holtzblatt,et al.  Contextual design: using customer work models to drive systems design , 1998, CHI Conference Summary.

[24]  Chen Yu,et al.  A multimodal learning interface for grounding spoken language in sensory perceptions , 2003, ICMI '03.

[25]  Anne Haake,et al.  Human-centric approaches to image understanding and retrieval , 2010, 2010 Western New York Image Processing Workshop.

[26]  Moreno I. Coco,et al.  The impact of attentional, linguistic, and visual features during object naming , 2013, Front. Psychol..

[27]  Jeff B. Pelz,et al.  Computational Integration of Human Vision and Natural Language through Bitext Alignment , 2015, VL@EMNLP.

[28]  Deb Roy,et al.  Integration of speech and vision using mutual information , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[29]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yejin Choi,et al.  Generalizing Image Captions for Image-Text Parallel Corpus , 2013, ACL.

[31]  Sanja Fidler,et al.  What Are You Talking About? Text-to-Image Coreference , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  A. Treisman,et al.  A feature-integration theory of attention , 1980, Cognitive Psychology.

[33]  Stephen M. Fiore,et al.  Perceptual (Re)learning: A Leverage Point for Human-Centered Computing , 2007, IEEE Intelligent Systems.

[34]  Jeff B. Pelz,et al.  Alignment of Eye Movements and Spoken Language for Semantic Image Understanding , 2015, IWCS.

[35]  Moreno I. Coco,et al.  Scan Patterns Predict Sentence Production in the Cross-Modal Processing of Visual Scenes , 2012, Cogn. Sci..

[36]  W. Levelt,et al.  Viewing and naming objects: eye movements during noun phrase production , 1998, Cognition.

[37]  Pengcheng Shi,et al.  Image Understanding from Experts' Eyes by Modeling Perceptual Skill of Diagnostic Reasoning Processes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Md. Monirul Islam,et al.  A review on automatic image annotation techniques , 2012, Pattern Recognit..

[39]  Nicolai Petkov,et al.  Comparison of color representations for content‐based image retrieval in dermatology , 2010, Skin research and technology : official journal of International Society for Bioengineering and the Skin (ISBS) [and] International Society for Digital Imaging of Skin (ISDIS) [and] International Society for Skin Imaging.

[40]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[41]  Cecilia Ovesdotter Alm,et al.  Object Categorization: Words and Pictures: Categories, Modifiers, Depiction, and Iconography , 2009 .