This paper investigates an on-line backpropagation learning system for a mobile robot that learns to follow a human. The scenario is part of a project to investigate Human-Robot Interaction within a social context. Because the environment is totally unknown to the system, training data have to be generated during the operation, for which a training data selection method is proposed. Two types of learning take place simultaneously in the system: the adaptive learning that learns slowly; the reactive learning that learns fast. These satisfy the system’s requirements for long-term adaptation and short-term reactivity respectively. The learning happens on-line and can adapt rapidly to the unknown environment. 1 Context of Research This paper presents research that is a part of a project that investigates autonomous learning of appropriate social distance by a mobile robot. This project aims to develop a robot that fulfils the principles of Human-Robot Interaction (HRI) in a social context with an emphasis on the maintenance of social space. Hall (1966) states that social space is an important element of any form of social organization and that the control of distance between agents is an important means of communication. A scenario proposed in the project is human-following by a mobile robot. An online learning method in this scenario is discussed in this paper. HRI has become a very active research topic. Adaptive systems, especially autonomous learning system, have been widely used because of the unpredictable dynamics in human-related interactions. One approach that has been adopted by many researchers is to analyse the human behaviour from experiments of people interact with others or a robot. By analysing the collected data, a mathematical model can be built and an adaptive system can be constructed for the robot to satisfy the interaction with human (e.g. Wood, 2006). Another commonly used method is to build the robot to interact in a creature-like way and to base the robot behaviour on existing well-studied physical and biological models (e.g. Arkin, 2003). Some key parameters in the model will be left to be adjusted by an adaptive system. Both methods have introduced social learning concepts and produced productive results. The control of social space has been studied in both approaches (e.g. Mitsunaga, et. al. 2005 and Nakauchi, 2002). However, we are trying to take a less studied perspective. Social behaviours are outcomes of social interaction, the foundation of which is responding to the attitude of the other (Ashworth, 1979). People behave based on the perception of others. Thus an attractive idea is to model the robot system to learn the behaviours of interaction according to the attitude of human without constraining the model into any specific scenario. While human’s attitude is associated with certain interaction, the robot will learn to interact in an appropriate way. In this sense, the attitude of the human will be a general reflection of their satisfaction with the interaction in which they are engaged. The human-following scenario has become the test bed of the simulation and the attitude of the human is designed to be associated with human-following behaviour. The sensory system of the robot must measure its position relative to the person, who in turn is able to input feedback reflecting his or her level of satisfaction, i.e. the attitude, with the robot’s current position. This sensor will contain a digital input device which the human will hold and use to provide feedback at any time. Similar devices to use in HRI have been studied in some previous researches (e.g. Koay, et. al., 2005). It will be connected to the robot mechanically in order that the position of the robot relative to the person can be measured. This sensory system will be simulated in the work reported here. The primary obstacle in this system is that the attitude of human is not, and can’t be directly transformed into, error measurements of the robot system. The study will focus on autonomous learning on this point and leave other dynamics for further research of the project. 2 Overview of the Algorithm Artificial neural networks (ANNs) form one of the most widely used autonomous learning methods. Error backpropagation (BP) learning has been well studied in the literature (e.g. Stroeve, 1998). As has been said, in the proposed scenario no direct error measurements exist but only an arbitrary performance reward score given by the human denoting their satisfaction with the interaction. Such a problem appears to fit the category of a policy-making algorithm and has been studied with reinforcement algorithms (Schaal, 1997). However, compared to previous studies, another problem we face is that the reward score has limited gradients and is discrete, and our scenario needs good generalisation, a hard problem for a conventional reinforcement network. Thus multi-layered feed forward (MLFF) ANNs with BP learning became our focus. A training data selection method is used to generate and optimize training data during simulations. The reward score given to the robot’s performance can thus control the learning of the system. On-line BP learning needs to review old data while it learns new patterns (Patterson, 1996). Adaptive learning is introduced to act as a reviewer in the system. This is a long-term learning process that gradually optimizes and generalizes the system performance through training, which is accompanied by a fast reactive learning procedure. The reactive learning enables the robot to respond to the attitude of the human quickly by minimising the error performance of the most recently collected data. The system consists of two small MLFF networks, with a set of matrices to select training data. The performance of the system has been tested in simulation and appears to be robust. The remainder of this paper is organised as following. Section 3 specifies the model of the people-following problem. Section 4 explains the structure and the learning algorithm of the system. Section 5 discusses the training procedures and demonstrates the simulation results and Section 6 evaluates the capabilities of the system. Finally, there is a short conclusion. 3 Model Specification A socially acceptable human-following robot must maintain the distance and position of which the target human most approves. Our focus in the study is to adapt the robot’s dynamics to reflect an unknown human preference about being followed. In the setting of the sensory system, the human in the simulation is simplified as a reward function of the robot’s position relative to the human. This function provides a single numerical score to indicate the human’s satisfaction. It represents a surface that relates the position to the reward score. Its peak refers to the most appropriate location. The robot is initialized so that neither the surface of the reward nor the consequences of its movements are known. Therefore it is essential that the robot learns proper actions to acquire the best position to take, relative to the human’s preference, i.e. the reward surface of the target human being. An assumption is made that the human will always maintain a constant velocity which the robot will acquire instantly as soon as it starts to follow. This allows the human to be considered as still by subtracting the constant velocity from the system. Another simplification is that all collisions are ignored (e.g. between the robot and the human). An example surface of the reward score is shown in Fig.1 Fig. 1. The contour plot of a reward surface: the contour lines illustrate the surface of the reward score that the human is able to provide relative to the robot’s position. The values of the reward score are marked on the contour line. The ‘flat surface’ marks the space without gradient. Fig. 1 is a map with a size of 3000 mm by 3000 mm. The target person shows a certain preference in both the angle and distance to the follower. It shows that the person would like to be followed behind on the left. The reward score belongs to the set [0, 1]: the higher the score, the higher the human’s satisfaction. The gradient of the reward only exists in a limited range. The area outside of this range maintains the lowest degree of satisfaction and is marked as the ‘flat surface’ that occupies most of the map. The surface is designed in this manner because it would be unrealistic for a human to input smooth continuous feedback over the whole space. We assume that the person has an (initially unknown) reward surface and that the robot’s objective is to find a sequence of movements that will take it to the position with the highest reward value. The system works in discrete time steps, where the interval between any two adjacent time indices is taken to be one second. A further assumption is made that the robot accomplishes any assigned movement instantly. The input to the system is the position of the robot and the outputs are the movements in the x and y axis in the next second, both of which are in the range of [-50, 50] mm. These assumptions were made only for the simplicity of simulation and can be relaxed easily without major change of the system configuration. 4 MLFF Network with BP Learning The system contains two MLFF networks, one for each output, which are the movements in x and y directions respectively. Separating the outputs into independent networks reduces the complexity of the system because the mapping dimensions are reduced for each network compared to an integrated one. The BP rule minimises output error, and the training data are the standard means of measuring such error. But this model provides no prior training data or any on-line information to use for training directly. Therefore a method of composing training data is required. 4.1 Training Data Selection MLFF BP learning requires the error value to be associated with each output. Such error cannot be measured dire
[1]
Sybert H. Stroeve,et al.
An analysis of learning control by backpropagation through time
,
1998,
Neural Networks.
[2]
Yasushi Nakauchi,et al.
A Social Robot that Stands in Line
,
2002,
Auton. Robots.
[3]
Masahiro Fujita,et al.
An ethological and emotional basis for human-robot interaction
,
2003,
Robotics Auton. Syst..
[4]
K. Dautenhahn,et al.
Comparing human robot interaction scenarios using live and video based methods: towards a novel methodological approach
,
2006,
9th IEEE International Workshop on Advanced Motion Control, 2006..
[5]
Takayuki Kanda,et al.
Robot behavior adaptation for human-robot interaction based on policy gradient reinforcement learning
,
2005,
2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.
[6]
Peter D. Ashworth.
Social Interaction and Consciousness
,
1979
.
[7]
Martin T. Hagan,et al.
Gauss-Newton approximation to Bayesian learning
,
1997,
Proceedings of International Conference on Neural Networks (ICNN'97).
[8]
Kerstin Dautenhahn,et al.
Methodological issues using a comfort level device in human-robot interactions
,
2005,
ROMAN 2005. IEEE International Workshop on Robot and Human Interactive Communication, 2005..
[9]
E. Hall,et al.
The Hidden Dimension
,
1970
.