An RL approach to common-interest continuous action games

In this paper we present a reinforcement learning technique based on Learning Automata (LA), more specific Continuous Action Reinforcement Learning Automaton (CARLA), introduced by Howell et. al. in [2]. LA are policy iterators, which have shown good convergence results in discrete action games with independent learners. The approach presented in this paper allows LA to deal with continuous action spaces. Recently, Rodriguez et al. [3] performed an analysis of the CARLA algorithm. The result of this analysis was an improvement of the CARLA method in terms of computation effort and local convergence properties. The improved automaton performs very well in single agent problems, but still has suboptimal performance with respect to global convergence in multi-agent settings. The CARLA algorithm has successfully been applied to control problems [2, 1]. However in real world applications systems can be coupled and each subsystem is to be controlled by an individual controller. The interaction of these controllers can be considered as a common interest game. The interacting dynamics will have the learners converging to a suboptimal solution if the subsystems are controlled ignoring the existence of each other. In such a situation a better exploration of the joint-action space is required. Exploring Selfish Reinforcement Learning (ESRL), introduced by [4], is an exploration method for independent LA playing a repeated discrete action game guaranteing convergence to the optimal Nash equilibrium. The supporting idea of this method is that a set of independent LA will converge to one of the Nash equilibria of the game, but not necessarily one from the Pareto front. ESRL proposes that once the agents converge to a Nash equilibrium, at least two learners should delete the selected action from their action spaces and restart learning. This allows the agents to find all dominant equilibria and agree on the best one. As the more interesting Nash equilibria are often also stronger attractors, the agents can quite efficiently reach Pareto optimal Nash equilibria. This paper introduces Exploring Selfish Continuous Action Reinforcement Learning Automaton (ESCARLA), an extension of the ESRL method to continuous action games. The supporting idea of ESRL is to exclude actions after every exploration phase. The problem with applying this approach in continuous action games, is that it makes no sense for the agents to delete a single action. Instead, a vicinity around the action should be identified and excluded. Now the agent must estimate when it crosses the boundary of the basin of attraction of the local attractor. In order to solve this problem we propose to use the absolute value of the covariance between the actions and rewards as a metric. Figure 1 shows the contour representation of a 2-players game example. There are three attractors in this example. The two local maxima located in the top left and bottom right corners have larger basins of attraction while the global maximum at the center has a narrower basin of attraction. Figure 2 shows the relation between the exploration and the covariance between actions and rewards from a single agent point of view. The first row shows a global view of the exploration. Three time intervals are shown. The first interval is the start of learning process (time-steps from 0 to 1000). The second interval is when the learners are reducing the global exploration (time-steps from 2000 to 3000). Notice this is a good time for deciding on which neighborhood to exclude. The last interval selected is when agents have converged to the local at-tractor (time-steps from 9000 to 10000). The second row shows the local information that the independent agents can access. The same time-steps are represented on each column but in this case we are plotting the selected actions on the horizontal axis and the corresponding reward on the vertical axis. The bottom row shows the absolute value of the covariance between actions and rewards over the whole learning process. Additionally, in order to have a better idea of how this covariance is evolving, the solid curve represents its average. The time-steps corresponding to the three moments introduced above are shaded in gray. This covariance reaches a low value at the beginning of learning since lots of explorations are performed by both agents. When the agents are exploring within the basin of attraction of a local attractor then the noise in the rewards observed by each agent is minimal so the covariance reaches its maximum. As agents converge to a locally superior strategy, less exploration is performed so therefore the covariance value drops down to zero. The safe region to exclude after the agent's actions have converged to the local optimum, can therefore be estimated at the moment when the absolute value of the covariance reaches its maximum value. A good way of estimating this region is using the percentiles of the probability density function of the actions. For a given confidence value c we can define a region as shown in expression (1) where percentile (p, f) represents the value where the probability density function f accumulates the probability p.