A computational model to connect gestalt perception and natural language

We present a computational model that connects gestalt visual perception and language. The model grounds the meaning of natural language words and phrases in terms of the perceptual properties of visually salient groups. We focus on the semantics of a class of words that we call conceptual aggregates e.g., pair, group, stuff, which inherently refer to groups of objects. The model provides an explanation for how the semantics of these natural language terms interact with gestalt processes in order to connect referring expressions to visual groups. Our computational model can be divided into two stages. The first stage performs grouping on visual scenes. It takes a visual scene segmented into block objects as input, and creates a space of possible salient groups arising from the scene. This stage also assigns a saliency score to each group. In the second stage, visual grounding, the space of salient groups, which is the output of the previous stage, is taken as input along with a linguistic scene description. The visual grounding stage comes up with the best match between a linguistic description and a set of objects. Parameters of the model are trained on the basis of observed data from a linguistic description and visual selection task. The proposed model has been implemented in the form of a program that takes as input a synthetic visual scene and linguistic description, and as output identifies likely groups of objects within the scene that correspond to the description. We present an evaluation of the performance of the model on a visual referent identification task. This model may be applied in natural language understanding and generation systems that utilize visual context such as scene description systems for the visually impaired and functionally illiterate. Thesis Supervisor: Deb K. Roy Title: Assistant Professor of Media Arts and Sciences

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  R. Hunt Colour Science : Concepts and Methods, Quantitative Data and Formulas , 1968 .

[3]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[4]  E. Brunswik,et al.  Ecological cue-validity of proximity and of other Gestalt factors. , 1953, The American journal of psychology.

[5]  R N Shepard,et al.  Multidimensional Scaling, Tree-Fitting, and Clustering , 1980, Science.

[6]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[7]  Stevan Harnad,et al.  Symbol grounding problem , 1990, Scholarpedia.

[8]  Rohini K. Srihari,et al.  Computational models for integrating linguistic and visual information: A survey , 2004, Artificial Intelligence Review.

[9]  Laura A. Carlson,et al.  Grounding spatial language in perception: an empirical and computational investigation. , 2001, Journal of experimental psychology. General.

[10]  Shimon Ullman,et al.  Structural Saliency: The Detection Of Globally Salient Structures using A Locally Connected Network , 1988, [1988 Proceedings] Second International Conference on Computer Vision.

[11]  Anne Treisman,et al.  Features and objects in visual processing , 1986 .

[12]  Eric Saund,et al.  Finding Perceptually Closed Paths in Sketches and Drawings , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  P Perona,et al.  Image recognition: visual grouping, recognition, and learning. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[14]  B. Landau,et al.  “What” and “where” in spatial language and spatial cognition , 1993 .

[15]  Kristinn R. Thórisson,et al.  Simulated Perceptual Grouping: An Application to Human-Computer Interaction , 2019, Proceedings of the Sixteenth Annual Conference of the Cognitive Science Society.

[16]  M. Wertheimer Laws of organization in perceptual forms. , 1938 .

[17]  Mark E. Gorzynski,et al.  CRT colorimetry. part I: Theory and practice , 1993 .

[18]  Leonard Talmy,et al.  How Language Structures Space , 1983 .

[19]  D. Roy A TRAINABLE VISUALLY-GROUNDED SPOKEN LANGUAGE GENERATION SYSTEM , 2002 .

[20]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Jitendra Malik,et al.  Blobworld: A System for Region-Based Image Indexing and Retrieval , 1999, VISUAL.

[22]  R. Nevatia,et al.  Perceptual Organization for Scene Segmentation and Description , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  David J. Fleet,et al.  Perceptual Organization as a Foundation for Graphics Recognition , 2001, GREC.

[24]  Paul U. Lee,et al.  How Space Structures Language , 1998, Spatial Cognition.

[25]  David G. Lowe,et al.  Perceptual Organization and Visual Recognition , 2012 .

[26]  J. Elder,et al.  Ecological statistics of Gestalt laws for the perceptual organization of contours. , 2002, Journal of vision.

[27]  F. Attneave Some informational aspects of visual perception. , 1954, Psychological review.

[28]  William B. Thompson,et al.  Building a Distance Function for Gestalt Grouping , 1975, IEEE Transactions on Computers.

[29]  Gerd Herzog,et al.  VIsual TRAnslator: Linking perceptions and natural language descriptions , 1994, Artificial Intelligence Review.

[30]  Simone Santini,et al.  Similarity Measures , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  A. Grafstein MIT Encyclopedia of the Cognitive Sciences , 2000 .

[32]  A. Tversky Features of Similarity , 1977 .

[33]  Michael Lindenbaum,et al.  A Generic Grouping Algorithm and Its Quantitative Analysis , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  David L. Waltz,et al.  On the Interdependence of Language and Perception , 1978, TINLAP.

[35]  G. Miller,et al.  Language and Perception , 1976 .

[36]  Robert A. Wilson,et al.  Book Reviews: The MIT Encyclopedia of the Cognitive Sciences , 2000, CL.

[37]  Gunther Wyszecki,et al.  Color Science: Concepts and Methods, Quantitative Data and Formulae, 2nd Edition , 2000 .

[38]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.