论文信息 - Active Observer Visual Problem-Solving Methods are Dynamically Hypothesized, Deployed and Tested

Active Observer Visual Problem-Solving Methods are Dynamically Hypothesized, Deployed and Tested

The STAR architecture was designed to test the value of the full Selective Tuning model of visual attention for complex real-world visuospatial tasks and behaviors. However, knowledge of how humans solve such tasks in 3D as active observers is lean. We thus devised a novel experimental setup and examined such behavior. We discovered that humans exhibit a variety of problem-solving strategies whose breadth and complexity are surprising and not easily handled by current methodologies. It is apparent that solution methods are dynamically composed by hypothesizing sequences of actions, testing them, and if they fail, trying different ones. The importance of active observation is striking as is the lack of any learning effect. These results inform our Cognitive Program representation of STAR extending its relevance to real-world tasks. 1. Visuospatial Tasks The widely acknowledged success of modern computer vision coupled with the assertions that their methodologies and performance are human-like may lead one to think that vision is a solved problem. The reality is far from this. The breadth of visuospatial tasks humans perform daily is stunning (Carroll 1996), and most have barely been considered by the computational community. One aspect of human vision that is insufficiently examined in computer vision as well as in cognitive architectures (CA) (review in Kotseruba & Tsotsos 2019) is visual attention. Of course, visual attention is widely acknowledged as important and many theoretical and practical investigations have appeared. However, they all seem limited in that they include only the most obvious and basic of attentional mechanisms, rarely involving a fixation control component. Further, they are often accompanied by claims of biological correspondence or similarity (e.g., Bengio 2019), when in fact, there is a significant chasm between human attentional abilities and what current computer vision systems include (for background, see Itti, Rees & Tsotsos 2005, Tsotsos 2011, Carrasco 2011, Nobre & Kastner 2014, Moore & Zirnsak 2017, Goodhew 2020). The STAR (Selective Tuning Attention Reference) architecture is based on this assertion: if the Selective Tuning (ST) model of attention actually represents the set of brain mechanisms of attention, then ST should provide all the attentional support a behaving brain (agent) would require. ST includes a broad and comprehensive set of mechanisms, many of them originally unintuitive but now with significant experimental support.1 In order to test the assertion, the full set of other functional components, as also found in most other CAs, must be connected to ST (Tsotsos & Kruijne 2014). This set includes fixation control in 3D space, something not common among CA's nor computer vision. STAR also does not assume that observers are passive, that is, that they are simply receivers of externally determined input and play no role in selecting what, how, why, when and how to observe their external world (Bajcsy et al. 2018). However, the study of human visual behavior as an active observer has been limited. As a result, any 1 The ideas behind ST appeared in Tsotsos (1988, 1990), and Tsotsos et al. 1(995). For full description see Tsotsos (2011). 2 potential testbed for STAR seems unavailable, as are even particular exemplar problems with which to test our ideas. The comprehensive description of human visuospatial abilities in Chapter 8 of Carroll (1996) proved to be very helpful. We chose as a first exemplar the same-different task, a task humans need to solve often, and which seems a basic component of many other tasks. Deciding if two objects are the same or different may seem straightforward. Often, we design objects to be easily discriminable, say by colour or size or pattern, but this is not always the case. Consider a task where you are given a part during an assembly task and need to go to a bin of parts in order to find another one of the same (e.g., assembly of furniture). Playing with interlocking toy blocks requires one to perform such tasks many times while constructing a block configuration, either copying from a plan, mimicking an existing one or building from one's imagination. There are many more examples. Obvious instances of this problem are not effective as probes into human solutions because humans are remarkable in their ability to home in on a workable strategy that can be used for most instances. We thus needed to push an experimental design to the extreme in order to discover the characteristics and limitations of the human solution space. The key question remains: What is the sequence of actions to correctly determine if two objects are the same? This problem has equal interest for human behavior as well as robot behavior. In the current AI and computer vision community, one's first approach might be to learn solutions. It is quite likely possible to learn a viewing policy that simply covers all parts of a viewing sphere around each object, and then compares the feature representations on the sphere surface. But this obvious, brute force solution does not illuminate how humans do this in a far more efficient manner – we solve simple cases quickly, take an increasing number of views with increasing task difficulty, rarely need to see a full spherical view, and almost never take a single complete set of views. However, such conclusions seem subjective, and no experimental evidence is available that explicates how humans solve such tasks. Related visuospatial tasks have also been studied by the cognitive neuroscience community and as shown in Tsotsos et al. (2021), they seem to be converging on the importance of flexible composition of elements to achieve solutions for dynamic task presentation. This is consistent with Tsotsos (2011), namely that, vision seems to involve a general-purpose processor that can be dynamically tuned to the task and input at hand, and attention is a set of mechanisms that tune and control the search processes inherent in perception and cognition. However, dynamic tuning cannot occur without an explicit executive controller, as argued in Tsotsos et al. (2021). To test this view and confirm consistency with the neuroscience view of flexible composition, data is required. The experiment described here may be helpful. The remainder of this paper proceeds as follows. The next section begins with a brief description of the general idea of the experiment, followed by a summary of the experimental environment and setup, the stimulus set, and a small sampling of our results and observations. Section 3 introduces Cognitive Programs that promise sufficient power and flexibility for the representation of active observer behavior. A summary of results concludes the paper. 2. Examining the Same-Different Task The classic instance of the same-different task is widely known from the work of Shephard & Metzler (1971). There, they used objects formed by concatenations of cubes, and depicted as black line perspective drawings on a white background. Subjects were shown pairs of these objects and were asked if the objects were the same or different. Stimuli were 4-5cm in linear extent, seen in two windows, viewed from 60cm. In other words, subjects were passive viewers with a constant target visual angle for each stimulus object. The "view" was pre-determined. Since reaction times were as long as 5s, there was plenty of time for eye movements, but no report of them was provided. Results showed that subjects mentally rotated one object into the other, this being an inference made by considering response time. However, taking a step backward,

John K. Tsotsos | Markus D. Solbach | M. Solbach

[1] John K. Tsotsos,et al. A Focus on Selection for Fixation , 2016 .

[2] John K. Tsotsos,et al. PESAO: Psychophysical Experimental Setup for Active Observers , 2020, ArXiv.

[3] Yoshua Bengio,et al. The Consciousness Prior , 2017, ArXiv.

[4] R. Shepard,et al. Mental Rotation of Three-Dimensional Objects , 1971, Science.

[5] John K. Tsotsos,et al. STAR-RT: Visual attention for real-time video game playing , 2017, ArXiv.

[6] Michael C. Pyryt. Human cognitive abilities: A survey of factor analytic studies , 1998 .

[7] Neal F. Johnson,et al. The Role of Chunking and Organization in The Process of Recall , 1970 .

[8] M. Carrasco. Visual attention: The past 25 years , 2011, Vision Research.

[9] John K. Tsotsos,et al. On the control of attentional processes in vision , 2021, Cortex.

[10] John K. Tsotsos,et al. Tracking Active Observers in 3D Visuo-Cognitive Tasks , 2021, ETRA Adjunct.

[11] John K. Tsotsos,et al. Feed-forward visual processing suffices for coarse localization but fine-grained localization in an attention-demanding context needs feedback processing , 2019, PloS one.

[12] John K. Tsotsos,et al. Revisiting active perception , 2016, Autonomous Robots.

[13] John K. Tsotsos. A Computational Perspective on Visual Attention , 2011 .

[14] Dino Pedreschi,et al. Trajectory pattern mining , 2007, KDD '07.

[15] Sabine Kastner,et al. The Oxford Handbook of Attention , 2014 .

[16] John K. Tsotsos,et al. Active Fixation Control to Predict Saccade Sequences , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17] Christof Koch,et al. What Is Consciousness? , 2018, Nature.

[18] Dileep George,et al. Beyond imitation: Zero-shot task transfer on robots by learning concepts as cognitive programs , 2018, Science Robotics.

[19] John K. Tsotsos,et al. Cognitive programs: software for attention's executive , 2014, Front. Psychol..

[20] Toni Kunic,et al. Cognitive Program Compiler , 2017 .

[21] Nils J. Nilsson,et al. Shakey the Robot , 1984 .

[22] John K. Tsotsos,et al. Modeling Visual Attention via Selective Tuning , 1995, Artif. Intell..

[23] John K. Tsotsos,et al. Neurobiology of Attention , 2005 .

[24] John K. Tsotsos,et al. 40 years of cognitive architectures: core cognitive abilities and practical applications , 2018, Artificial Intelligence Review.

[25] T. Moore,et al. Neural Mechanisms of Selective Visual Attention. , 2017, Annual review of psychology.

[26] John K. Tsotsos,et al. Blocks World Revisited: The Effect of Self-Occlusion on Classification by Convolutional Neural Networks , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[27] Burr Settles,et al. Active Learning Literature Survey , 2009 .

[28] S. Ullman. Visual routines , 1984, Cognition.

[29] S. Goodhew,et al. The Breadth of Visual Attention , 2020 .

[30] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[31] Zhihui Li,et al. A Survey of Deep Active Learning , 2020, ACM Comput. Surv..

[32] R. Desimone,et al. Neural mechanisms of selective visual attention. , 1995, Annual review of neuroscience.

[33] John K. Tsotsos. Analyzing vision at the complexity level , 1990, Behavioral and Brain Sciences.