A perspective on off-policy evaluation in reinforcement learning

The goal of reinforcement learning (RL) is to build an autonomous agent that takes a sequence of actions to maximize a utility function by interacting with an external, unknown environment. It is a very general learning paradigm that can model a wide range of problems, such as games, robotics, autonomous driving, humancomputer interactions, recommendation, healthcare, and many others. In recent years, powered by advances in deep learning and computing power, RL has seen great successes, with AlphaGo/AlphaZero as a prominent example. Such impressive outcomes have sparked fast growing interests in using RL to solve real-life problems. In this article, I will argue that we must address the evaluation problem before RL can be widely adopted in real-life applications. In RL, the quality of a policy is often measured by the average reward received if the policy is followed by the agent to select actions. If the environment is simulatable, as in computer games, evaluation can be done simply by running the policy. However, for most real-life problems like autonomous driving and medical treatment, running a new policy in the actual environment can be expensive, risky and/or unethical. Creating a simulated environment for policy evaluation is common practice, but building a high-fidelity simulator can often be harder than finding an optimal policy itself (consider building a simulated patient that covers all possible medical conditions). Therefore, RL practitioners often find themselves stuck in a painful dilemma: in order to deploy a new policy, they have to show it is of sufficient quality, but the only reliable way to do so appears to be deploying the policy!