论文信息 - Knowledge and ignorance in reinforcement learning

Knowledge and ignorance in reinforcement learning

The field of Reinforcement Learning is concerned with teaching agents to take optimal decisions to maximize their total utility in complicated environments. A Reinforcement Learning problem, generally described by the Markov Decision Process formalism, has several complex interacting components, unlike in other machine learning settings. I distinguish three: the state-space/transition model, the reward function, and the observation model. In this thesis, I present a framework for studying how the state of knowledge or uncertainty of each component affects the Reinforcement Learning process. I focus on the reward function and the observation model, which has traditionally received little attention. Algorithms for learning good policies when these components are completely specified are well understood. However, it is less clear what to do when they are unknown, uncertain or irrelevant. In this thesis, I describe how to adapt Reinforcement Learning algorithms to cope with these situations. Recently there has been great interest in the Inverse Reinforcement Learning problem where the objective is to learn the reward function from evidence of an agent's reward-maximizing policy. The usual goal is to perform apprenticeship learning where the agent learns the optimal action to perform from an expert. However, sometimes the reward function is of independent interest as well. I describe a Bayesian Inverse Reinforcement Learning approach to this problem. BIRL uses a generative model to describe the decision-making process and by inverting it we can infer the reward function from action observations. It is distinguished from other IRL approaches by placing emphasis on the accuracy of the reward function in itself, and not just as an intermediate step to apprenticeship learning. BIRL is also able to handle incomplete and contradictory information from the expert. It has been applied successfully to preference elicitation problem for computer games and robot manipulation. In a recent comparison of IRL approaches, BIRL was the best performing general IRL algorithm. I also extend this model to do a related task, Reward Shaping . In reward shaping, we seek to adjust a known reward function to make the learning agent converge on the optimal policy as fast as possible. Reward shaping has been proposed and studied previously in many applications, typically using additive potential functions. However the requirement of absolute policy-invariance is too strict to admit many useful cases of shaping. I define Bayesian Reward Shaping, which is a generalization to a soft form of reward shaping, and provide algorithms for achieving it. The impact of observation models on reinforcement learning has been studied even less than reward functions. This is surpising, considering how adding partial observability to an MDP model blows up the computational complexity and hence a better understanding of the tradeoffs between representational accuracy and efficiency would be helpful. In this work, I describe how in certain cases POMDPs can be approximated by MDPs or slightly more complicated models with bounded performance loss. I also present an algorithm, called Smoothed Q-Learning for learning policies when the observation models are uncertain. Smoothed Sarsa is based on the idea that in many real-world POMDPs better state estimates can be made at later time steps and thus delaying the backup step of a temporal difference-based algorithm can shortcut the uncertainty in the obervation model and approximate the underlying MDP better. Combining these approaches together (Bayesian Reward Shaping and Smoothed Sarsa), a mobile robot was trained to execute delivery tasks in an office environment.

Dan Roth | Deepak Ramachandran