论文信息 - A perspective on off-policy evaluation in reinforcement learning

A perspective on off-policy evaluation in reinforcement learning

The goal of reinforcement learning (RL) is to build an autonomous agent that takes a sequence of actions to maximize a utility function by interacting with an external, unknown environment. It is a very general learning paradigm that can model a wide range of problems, such as games, robotics, autonomous driving, humancomputer interactions, recommendation, healthcare, and many others. In recent years, powered by advances in deep learning and computing power, RL has seen great successes, with AlphaGo/AlphaZero as a prominent example. Such impressive outcomes have sparked fast growing interests in using RL to solve real-life problems. In this article, I will argue that we must address the evaluation problem before RL can be widely adopted in real-life applications. In RL, the quality of a policy is often measured by the average reward received if the policy is followed by the agent to select actions. If the environment is simulatable, as in computer games, evaluation can be done simply by running the policy. However, for most real-life problems like autonomous driving and medical treatment, running a new policy in the actual environment can be expensive, risky and/or unethical. Creating a simulated environment for policy evaluation is common practice, but building a high-fidelity simulator can often be harder than finding an optimal policy itself (consider building a simulated patient that covers all possible medical conditions). Therefore, RL practitioners often find themselves stuck in a painful dilemma: in order to deploy a new policy, they have to show it is of sufficient quality, but the only reliable way to do so appears to be deploying the policy!

Lihong Li

[1] Qiang Liu,et al. Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[2] Wei Chu,et al. A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[3] John Langford,et al. Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[4] Nan Jiang,et al. Doubly Robust Off-policy Evaluation for Reinforcement Learning , 2015, ArXiv.

[5] Thorsten Joachims,et al. The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.

[6] Nan Jiang,et al. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[7] Lihong Li,et al. Toward Minimax Off-policy Value Estimation , 2015, AISTATS.

[8] Filip Radlinski,et al. Online Evaluation for Information Retrieval , 2016, Found. Trends Inf. Retr..

[9] Joaquin Quiñonero Candela,et al. Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[10] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[11] Miroslav Dudík,et al. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.