Abstract We present a continuous learning framework forlearning simple visual concepts and its implementation inan artificial cognitive system. The main goal is to learn as-sociations between automatically extracted visual featuresand words that describe the scene in an open-ended, contin-uous manner. In particular, we address the problem of cross-modal learning of elementary visual properties and spatialrelations; we show that the same learning mechanism canbe used to learn both types of concepts. We introduce andanalyse several learning modes requiring different levels oftutor supervision, ranging from a completely tutor driven toa completely autonomous exploratory approach. 1 Introduction In the real world, a cognitivesystem should possess the abil-ity to learn and adapt in a continuous, open-ended, life-longfashion in an ever-changing environment. As an example ofsuch a learning framework, we need look no further than atthe successful application of continuous learning in humanbeings. As humans, we first learn a new visual concept (e.g.,an object category, an object property, an action pattern, anobject affordance, etc.) by encountering a few examples ofone. Later, as we comeacross moreinstances differentto theoriginal examples, we not only recognise them, but also up-date our representation of learned visual concepts, based onthe salient properties of the new examples and without hav-ing visual access to the previous examples. In this way, weupdate or enlarge our ontology in an efficient and structuredway by encapsulating new information extracted from theperceived data, which enables adaptation to new visual in-puts and the handling of novel situations we may encounter.While the primaryfocus of this idea is on the incrementalnature of the knowledge update, another key aspect shouldbe noted; that being the scrutinisation of various visual fea-tures and the determination of which features are useful forrepresenting the chief visual attributes of the object or scenein question. Since a continuous learning framework wouldnot retain complete data from previously learned samples,it would not have the luxury of being able to reference spe-cific details across multiple samples in order to learn. Give nthis restriction,continuouslearninglendsitself to anabstractmulti-modal system involving interaction with a user.In this paper we present a framework for learning sim-ple visual concepts that addresses the premises mentionedabove. The main goal is to learn associations between auto-matically extracted visual features and words describing thescene in an open-ended,continuousmanner. The continuousand multimodal nature of the problem demands careful sys-tem design. Our implemented system is composed of visionand communication subsystems providing the visual inputand enabling verbal dialogue with a tutor. Such a multi-faceted activesystem providesmeans forefficientcommuni-cation facilitating user-friendly and continuous cross-modallearning.In particular, we address the problem of learning visualproperties (such as colour or shape) and spatial relations(such as ‘to the left of’ or ‘far away’). The main goal isto find associations between words describing these con-cepts and simple visual features extracted from the images.This symbol grounding problem
[1]
Roger K. Moore.
Computer Speech and Language
,
1986
.
[2]
Edoardo Ardizzone,et al.
Integrating Subsymbolic and Symbolic Processing in Artificial Vision
,
1992
.
[3]
Paul Davidsson.
Toward a general solution to the symbol grounding problem: combining machine learning and computer vision
,
1993,
AAAI 1993.
[4]
P. Gärdenfors.
Three levels of inductive inference
,
1995
.
[5]
Luc Steels,et al.
Grounding adaptive language games in robotic agents
,
1997
.
[6]
Salvatore Gaglio,et al.
A Cognitive Architecture for Artificial Vision
,
1997,
Artif. Intell..
[7]
Stevan Harnad.
The Symbol Grounding Problem
,
1999,
ArXiv.
[8]
R. Sun.
Symbol Grounding: A New Look At An Old Idea
,
1999
.
[9]
Paul Vogt,et al.
The physical symbol grounding problem
,
2002,
Cognitive Systems Research.
[10]
Deb K. Roy,et al.
Learning visually grounded words and syntax for a scene description task
,
2002,
Comput. Speech Lang..
[11]
Alex Pentland,et al.
Learning words from sights and sounds: a computational model
,
2002,
Cogn. Sci..
[12]
Michael T. Rosenstein,et al.
Symbol Grounding With Delay Coordinates
,
2003
.
[13]
Ilkay Ulusoy,et al.
Generative versus discriminative methods for object recognition
,
2005,
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).
[14]
Nick Hawes,et al.
BALT & CAST: Middleware for Cognitive Robotics
,
2007,
RO-MAN 2007 - The 16th IEEE International Symposium on Robot and Human Interactive Communication.