Offline policy evaluation across representations with applications to educational games

Consider an autonomous teacher agent trying to adaptively sequence material to best keep a student engaged, or a medical agent trying to help suggest treatments to maximize patient outcomes. To solve these complex reinforcement learning problems, we must first decide on a policy representation. But determining the best representation can be challenging, since the environment includes many poorly-understood processes (such as student engagement) and is therefore difficult to accurately simulate. These domains are also high stakes, making it infeasible to evaluate candidate representations by running them online. Instead, one must leverage existing data to learn and evaluate new policies for future use. In this paper, we present a data-driven methodology for comparing and validating policies offline. Our method is unbiased, agnostic to representation, and focuses on the ability of each policy to generalize to new data. We apply this methodology to a partially-observable, high-dimensional concept sequencing problem in an educational game. Guided by our evaluation methodology, we propose a novel feature compaction method that substantially improves policy performance on this problem. We deploy the best-performing policies to 2,000 real students and show that the learned adaptive policy shows statistically significant improvement over random and expert baselines, improving our achievement-based reward measure by 32%.

[1]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[2]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[3]  Yoshua Bengio,et al.  An Input Output HMM Architecture , 1994, NIPS.

[4]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[5]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[6]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[7]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[8]  KearnsMichael,et al.  Optimizing dialogue management with reinforcement learning , 2002 .

[9]  S. Singh,et al.  Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun System , 2011, J. Artif. Intell. Res..

[10]  Leonid Peshkin,et al.  Learning from Scarce Experience , 2002, ICML.

[11]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[12]  John N. Tsitsiklis,et al.  Bias and variance in value function estimation , 2004, ICML.

[13]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[14]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[15]  Peter Stone,et al.  Cobot in LambdaMOO: An Adaptive Social Statistics Agent , 2006, Autonomous Agents and Multi-Agent Systems.

[16]  Joel R. Tetreault,et al.  Comparing the Utility of State Features in Spoken Dialogue Using Reinforcement Learning , 2006, NAACL.

[17]  Joelle Pineau,et al.  Adaptive Treatment of Epilepsy via Batch-mode Reinforcement Learning , 2008, AAAI.

[18]  Tiffany Barnes,et al.  Toward Automatic Hint Generation for Logic Proof Tutoring Using Historical Student Data , 2008, Intelligent Tutoring Systems.

[19]  David Hsu,et al.  SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces , 2008, Robotics: Science and Systems.

[20]  Kurt VanLehn,et al.  Reinforcement Learning-based Feature Seleciton For Developing Pedagogically Effective Tutorial Dialogue Tactics , 2008, EDM.

[21]  Masashi Sugiyama,et al.  Adaptive importance sampling for value function approximation in off-policy reinforcement learning , 2009, Neural Networks.

[22]  Stuart J. Russell,et al.  RAPID: A Reachable Anytime Planner for Imprecisely-sensed Domains , 2010, UAI.

[23]  Joelle Pineau,et al.  Informing sequential clinical decision-making through reinforcement learning: an empirical study , 2010, Machine Learning.

[24]  Vincent Charvillat,et al.  Reinforcement Learning for Online Optimization of Banner Format and Delivery , 2011 .

[25]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.