An Artificial High-Level Vision Agent for the Interpretation of the Operations of a Robotic Arm

We describe an artificial high-level vision agent for the symbolic and graphic interpretation of data coming from a video camera that acquires the image sequences of the SPIDER robot arm of the EUROPA system during its operations. The agent generates the perception grounded predicates obtained by image sequences and it provides a 3D estimation of the arm movements, thus allowing the scientist user of SPIDER to receive meaningful feedback of his operations on the arm during a scientific experiment. 1 . INTRODUCTION We describe an artificial high-level vision agent for the interpretation of data coming from a video camera that acquires the image sequences of the SPIDER robot arm of the EUROPA system [7,11,15] during its operations (see Fig. 1). The described software module is related to the interpretation of sensory data in the framework of an AS1 project aiming at the application of A1 techniques to the design and realization of an effective and flexible system for the supervision of the SPIDER arm. The arm will work on board of the International Space Station (I=) P I . The framework project is an Italian three years research project [I ,6,17] sponsored by the Italian Space Agency (ASI) involving A1 researchers from the Universities of Rome, Turin, Genoa, Palermo, Parma, from the IP-CNR of Rome and from the IRST of Trento. The main aim of the vision agent is the advancement of the state of art in the field of artificial vision for spatial robotics by introducing and integrating artificial vision techniques that offer a unique opportunity for providing the SPIDER arm operations with effective greater degrees of autonomy [2,3]. Fig. I . The SPIDER arm of the EUROPA system. The valuable capabilities of the vision agent are: to individuate and segment the SPIDER arm also in contrasted and irregular backgrounds; to perform a 3D estimation of the position of the arm by camera images; to interpret complex movements of the arm acquired by a camera in terms of symbolic descriptions. The implemented computer vision agent is based on three main components: (0 the perception component; (ii) the scene description component; (iii) the visualization component. I ' ro~ 1'1fth lnternat~onal Symposium on Art~fic~al Intell~gence, Rohot~cs and Automat~on In Space, 1-3 Tune 1999 (ESA SP-440) In the following, Sect. 2 describes the perception component of the system, i.e., how the system perform the low-level image processing in order to individuate and segment the SPIDER arm. Sect. 3 describes the scene description component, in which the acquired image is interpreted both in terms of 3D parameters and in terms of generated symbolic assertions. Sect. 4 describes the visualization component, in which the user may interact with the agent components, and Sect. 5 describes the implementation details of the system. Finally, Sect. 6 outlines some conclusions and future developments. 2. THE PERCEPTION COMPONENT The perception component of the agent processes the image data coming from a video camera that acquires the operations of the SPIDER arm. The main task of this component is to estimate the positions of the arm in the acquired image. It should be noted that the estimation, which is generated solely by the visual data, may be useful also for fault identifications of the position sensors placed on the joints of the arm. The images acquired by the camera are processed by the contour module that extracts the arm contours by a suitable algorithm based on snakes [5,9,12]. The snake is a deformable curve that moves in the image under the influence of forces related to the local distribution of the gray levels. When the snake reaches an object contour, it is adapted to its shape. In this way it is possible to extract the object shape of the image view. The snake as an open or closed contour is described in a parametric form by: where x(s), y(s) are x,y co-ordinates along the contour and s is the normalized arc length: The snake model adopted is based on circles and squares, in order to better extract the arm components (see Fig. 2). The snake model defines the energy of a contour, named the snake energy, E , , to be: The energy integral is a functional since its independent variable is a function. The internal energy, E,,, is formed from a Tikhonov stabilizer and is defined: where I I is the Euclidean norm. The first order continuity term, weighted by a($, makes the contours behave elastically, whilst the second order curvature term, weighted by b(s), makes it resistant to bending. For example, setting b(s) = 0 at points, allows the snake to become second-order discontinuous at point and develop a corner. The image functional determines the features which will have a low image energy and hence the features that attract the contours. In general this functional made up of three terms: where w denotes a weighting constant. Each of w and E correspond to lines, edges and termination respectively. The snake used in this framework has only edge functional which attracts the snake to point at high gradient: Fig. 2. Contour module extraction by the snake technique. This is the image functional proposed by Kass [12]. It is a scale based edge operator that increases the locus of attraction of energy minimum. Go is a Gaussian of standard deviation sigma which controls the smoothing process prior to edge operator. Minima of E,,,, lies on zero-crossing of G, * d ~ ( x , ~ ) which defines edges in Marr-Hildreth [9,10] theory. Scale space filtering is employed, which allows the snake to come into equilibrium on a heavily filtered image, and then the level of filtering is reduced, increasing the locus of attraction of a minimum. The implemented snake allows to extract the arm shape in a simple way and in short time. Fig. 2 shows the results of the contour module. From the extracted arm snake it is possible to estimate the position of the links of the arm in the image plane, i.e., without the depth information, which is recovered by the scene description component. Let us consider a generic link i of the arm at time t; the link is characterized by its 3D coordinates: A generic posture of the SPIDER arm at time t is characterized by the vector x(t) which individuates the seven links of the arm: The snake information allows us to estimate the first coordinates of each link, i.e., their projection in the image plane: 3. THE SCENE DESCRIPTION COMPONENT The scene description component receives as input the data coming from the perception component and it generates a symbolic description of the arm operations. This component is based on a self-organizing neural network with a suitable explicit representation of time sequences 14,141. Each unit of the ARSOM is an autoregressive (AR) filter, able to classify and recognize variable inputs. The map auto-organizes during an unsupervised learning phase. Each unit of the map characterize a sequence of movements of the SPIDER arm. Let us consider a generic movement associated with the SPIDER arm. The movement is characterized by a sequence of n postures: ~ ( t ) , x(t I ) , . . . , x(t (n 1 ) ) The AR model associated with this movement is: x ( t + l ) = A 0 x ( t ) + A l x ( t I ) + . . . . . + A , , x ( t ( n l ) ) + e ( t ) The order of the model is n, the A O , A , ; . . , A n , matrices are the weights of the model, and e(t)is the error matrix. Let us denote B the global matrix related to the weight matrices: T B = [ A ~ , A ~ , . . . , A , ~ ] and with X(t) the global matrix related to the postures. We may write the previous equation in a more compact form: x(t + 1 ) = x T ( t ) ~ + e( t ) The optimal weights matrices are found by minimizing the error matrix e( f ) We have adopted the alms iterative method, that is: where h,, is the neighborhood kernel: In this equation, r is a suitable parameter and N , is the learning window. Fig.3. Error diagram vs training epochs. The neural network, after a careful training phase, is able to classify the temporal sequences of movements of the arm into meaningfil prototypical predicates. Fig. 3 shows the diagram of the error of the neural network during the training phase. It should be noted that, after a few hundred learning steps, the error of the network is near zero value. When the estimation of the coordinates of the link in the image plane are presented to the network: x' ( t ) , x' ( t I ) , . . . x' ( t ( n 1)) the network is able to predict the full vector x(t+ l ) , i.e.. the vector with all the three coordinates of the posture of the arm links. Fig. 4. Prediction enor of the network. Fig. 4 shows the prediction error of the network during its operations. It should be noted that the error, while is variable. it maintains in a reasonable limit. Furthermore, the network is also able to perform a classification of the global arm movement and to present as output a symbolic predicate describing the movement itself. Examples of the learned predicates describing the operations of the arm are: S t r e t c h i n g u p , Stretching-down, Seizing, Grasping. The neural network approach presents the main advantage that it avoids an explicit description of the discrimination functions for the arm operations, as this function is learned during the training phase. Furthermore, the neural network is robust with respect to the noise, as it is able to correctly classify the arm operations also when the movements estimations of some links are missing or corrupted. In the operation tests performed, the network has been able to perform the 100% success on the classification task. To analyze the operation of the network, tests are performed on the recognition task when some links information is missed. Table 1 reports the obtained results. It should be noted that in the worst case, when the two links 1 and 3 are missing, the network is able to perform 5 1 % of success recognition. I Missing I Recognition I Table 1. Recognition % with respect to the missing links links I % 4. THE VISUALIZ