Overcoming Non-Stationarity in Uncommunicative Learning

Reinforcement learning is a promising technique for learning agents to adapt their own strategies in multi-agent systems. Most existing reinforcement learning algorithms are designed from a single-agent''s perspective and for simplicity assume the environment is stationary, i.e., the distribution of the utility of each state-action pair does not change. However, in a more realistic model of multi-agent systems, the agents are continually adapting their own strategies owing to different utilities at different times. Because of the non-stationarity, multi-agent systems are more sensitive to the trade-off between exploitation, which uses the best strategy so far, and exploration, which tries to find better strategies. Exploration is especially important to these changing circumstances. In this paper, we assume that the utility of each state-action pair is a stochastic process. This allows us to describe the trade-off dilemma as a Brownian bandit problem to formalize Sutton''s recency-based exploration bonus in non-stationary environments. To demonstrate the performance of the exploration bonus, we build agents using Q-learning algorithm with a smoothed best response dynamics. The simulations show that the agents can efficiently adapt to changes in their peers'' behaviors whereas the same algorithm, using Boltzmann exploration, cannot adapt.