Object Action Complexes as an Interface for Planning and Robot Control

Much prior work in integrating high-level artificial intelligence planning technology with low-level robotic control has foundered on the significant representational differences between these two areas of research. We discuss a proposed solution to this representational discontinuity in the form of object-action complexes (OACs). The pairing of actions and objects in a single interface representation captures the needs of both reasoning levels, and will enable machine learning of high-level action representations from low-level control representations. I. I  B The different representations that are effective for continuous control of robotic systems and the discrete symbolic AI presents a significant challenge for integrating AI planning research and robotics. These areas of research should be able to inform one another. However, in practice, many collaborations have foundered on the representational differences. In this paper, we propose the use of object-action complexes[1] to address the representational difference between these reasoning components. The representations used in the robotics community can be generally characterized as vectors of continuous values. These vectors may be used to represent absolute points in three dimensional space, relative points in space, joint angles, force vectors, and even world-level properties that require realvalued models [2]. Such representations allow system builders to succinctly specify robot behavior since most if not all, of the computations for robotic control are effectively captured as continuous transforms of continuous vectors over time. AI representations, on the other hand, have focused on discrete symbolic representations of objects and actions, usually using propositional or first-order logics. Such representations typically focus on modeling the high-level conceptual state changes that result from action execution, rather than the lowlevel continuous details of action execution. Neither of the representational systems alone cover the requirements for controlling deliberate action, however, both levels seem to be required to produce human level behavioral control. Our objective is to propose an interface representation that will both allow the effective exchange of information between these two levels and the learning of high level action representations on the basis of the information provided by the robotic control system. Any such representation must provide clear semantics, and be easily manipulable at both levels. Further it must leverage the respective strengths of the two representation levels. In particular, the robotic control system’s access to the actual physical state of the world through its sensors and effectors is essential to learning the actions the planning system must reason about. Each low-level action executed by the robot offers the opportunity to observe a small instantiated fragment of the state transition function that the AI action representations must capture. Therefore, we propose that the robotic control system provide fully instantiated fragments of the planning domains state transition function, that is captured during lowlevel execution, to the high-level AI system to enable the learning of abstract action representations. We will call such a fragment an instantiated state transition fragment (ISTF), and define it to be a situated pairing of an object and an action that captures a small, but fully instantiated, fragment of the planning domain’s state transition function. The process of learning domain invariants from repeated, reproducible instances of very similar ISTFs will result in generalizations over such instances that we will call object-action complexes (OACs). To see how this is done, the rest of this paper will first discuss a detailed view of a robot control system, then we will discuss an AI planning level description of the same domain. We will then more formally define ISTFs and OACS, show how ISTFs can be produced by the robot control system, and how OACs relate to the AI planning level description. We will then discuss the learning of OACs on the basis of ISTFs. To do all this, we require a particular domain for the robot to interact with. Imagine a relatively standard but simple robot control scenario illustrated in Figure 1. It consists of an arm with a gripper, a table with two light colored cubes and one dark colored cube. The robot has the task of placing the cubes into a box, also located on the table. We will also assume the robot is provided with a camera to view the objects in the domain. However, at the initial stage, the system does not have any knowledge of those objects. The only initial world knowledge available to the system is provided by the vision module, and the hard-coded action reflexes that this visual input can elicit. Fig. 1. Illustration of how object classes are discovered from basic uninformed reflex actions. II. V- R D D  O  A We assume a vision front-end based on an Early Cognitive Vision framework (see [3]) that provides a scene representation composed of local 3D edge descriptors that outline the visible contours of the scene [4]. Because the system lacks knowledge of the objects that make up the scene, this visual world representation is unsegmented: descriptors that belongs to one of the objects in the scene are not explicitly distinct from the ones belonging to another object, or to the background (this is marked by question marks in Figure 1-2). This segmentation problem has been largely addressed in the literature [5], [6], [7]. However, while these segmentation methods are purely vision-based and do not require of the agent to interact with the scene they are unsatisfying for our purpose because they assume certain qualities from the objects in order to segment them: e.g., constant color or texture, moving objects, etc. Instead we will approach the problem from another angle: we will assume that the agent is endowed with a basic reflex action [8] (Figure 1-3) that is elicited directly by specific visual feature combinations in the unsegmented world representation. The outcome of these reflexes will allow the agent to gather further knowledge about the scene. This information will be used to segment the visual world into objects and identify their affordances. We will only consider a single kind of reflex here: the agent tries to grasp any planar surface in the scene.1 The likely locations of such planar surfaces are inferred from the presence of a coplanar pair of edges in the unsegmented visual world. This type of reflex action is described in [8]. Every time the agent executes such a reflex, haptic information allows the system to evaluate the outcome: either the grasp was successful and the gripper is holding something, or it failed and the gripper closed on thin air. A failed attempt drives the agent to reconsider its original assumption (the presence of a graspable plane at this location in the scene), whereas a successful attempt confirms the feasibility of this reflex. Moreover, once a successful grasp has been performed, the agent has gained physical control over some part of the scene 1Note that other kind of reflex actions could be devised to enable other basic actions than grasping. (i.e. the object grasped, Figure 1-4). If we assume that we know the full kinematics of the robot’s arm (which is true for an industrial robot), it is then possible to segment the grasped object from the rest of the visual world as it is the only part that moves synchronously with the arm of the robot. At this point a new “object” relevant for the higher level planning model is “born”. Having physical control of an object allows the agent to segment it and to visually inspect it under a variety of viewpoints and construct an internal representation of the full 3D shape of the object (see [9]). This shape can then be stored as the description of newly discovered class A (Figure 15) that affords grasp-reflex-A encoding the initial reflex that “discovered” the object. The object held in the gripper is the first instance a1 of the class A. The agent can use its new knowledge of class A to reconsider its interpretation of the scene: using a simple object recognition process (based on the full 3D representation of the class), all other instances (e.g., in our example a2) of the class in the scene are identified and segmented from the unknown visual world. Thus through a reflex-based exploration of the unknown visual world object classes can be discovered by the system until it achieves an informed, fully segmented representation of the world, where all objects are instances of symbolic classes and carry basic affordances. To distinguish the specific successful instances of the robot’s reflexes, we will refer to the specific instance of the reflex that was successful for the object as a particular motor program. Note that such motor programs are defined relative to a portion of an object, in our example, the surface that was grasped. We will extend this by assuming all motor programs can be defined relative to some object. The early cognitive vision system [4], the grasping reflex [8] as well as the accumulation mechanism [9] that together provides a segmentation of the local feature descriptors into independent objects currently exist in one integrated system that we will use as a foundation for this architecture. III. R AI P A As we have noted, we can also model this robot domain scenario using a formal AI representation. In this case, we will formalize the robot domain using the Linear Dynamic Event Calculus (LDEC) [10], [11], a logical language that combines aspects of the situation calculus with linear and dynamic logics, to model dynamically-changing worlds[12], [13], [14]. Our LDEC representation will define the following actions. Definition 1: High-Level Domain Actions • grasp(x) – m

[1]  Jitendra Malik,et al.  Motion segmentation and tracking using normalized cuts , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[2]  B. S. Manjunath,et al.  Unsupervised Segmentation of Color-Texture Regions in Images and Video , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  G. Aschersleben,et al.  The Theory of Event Coding (TEC): a framework for perception and action planning. , 2001, The Behavioral and brain sciences.

[4]  Tony Plate,et al.  Holographic Reduced Representations: Convolution Algebra for Compositional Distributed Representations , 1991, IJCAI.

[5]  Norbert Krüger,et al.  Accumulation of object representations utilising interaction of robot action and perception , 2002, Knowl. Based Syst..

[6]  Florentin Wörgötter,et al.  Editorial: ECOVISION: Challenges in Early-Cognitive Vision , 2007, International Journal of Computer Vision.

[7]  John McCarthy,et al.  SOME PHILOSOPHICAL PROBLEMS FROM THE STANDPOINT OF ARTI CIAL INTELLIGENCE , 1987 .

[8]  Norbert Krüger,et al.  Accumulation of Object Representations Utilizing Interaction of Robot Action and Perception , 2000, DAGM-Symposium.

[9]  Florentin Wörgötter,et al.  Multi-modal Scene Reconstruction using Perceptual Grouping Constraints , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[10]  Günther Palm,et al.  Bidirectional Retrieval from Associative Memory , 1997, NIPS.

[11]  Edwin P. D. Pednault,et al.  ADL: Exploring the Middle Ground Between STRIPS and the Situation Calculus , 1989, KR.

[12]  Richard Fikes,et al.  STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving , 1971, IJCAI.

[13]  Bernhard Beckert,et al.  Dynamic Logic , 2007, The KeY Approach.

[14]  Murat Kunt,et al.  Spatiotemporal Segmentation Based on Region Merging , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Mark Steedman,et al.  Plans, Affordances, And Combinatory Grammar , 2002 .

[16]  H. C. LONGUET-HIGGINS,et al.  Non-Holographic Associative Memory , 1969, Nature.

[17]  Patrick Lincoln,et al.  Linear logic , 1992, SIGA.

[18]  Karl Steinbuch,et al.  Die Lernmatrix , 2004, Kybernetik.

[19]  J. A. Anderson,et al.  A memory storage model utilizing spatial correlation functions , 1968, Kybernetik.

[20]  Richard M. Murray,et al.  A Mathematical Introduction to Robotic Manipulation , 1994 .