Q-learning for history-based reinforcement learning

We extend the Q-learning algorithm from the Markov Decision Process setting to problems where observations are non-Markov and do not reveal the full state of the world i.e. to POMDPs. We do this in a natural manner by adding ‘0 regularisation to the pathwise squared Q-learning objective function and then optimise this over both a choice of map from history to states and the resulting MDP parameters. The optimisation procedure involves a stochastic search over the map class nested with classical Q-learning of the parameters. This algorithm ts perfectly into the feature reinforcement learning framework, which chooses maps based on a cost criteria. The cost criterion used so far for feature reinforcement learning has been model-based and aimed at predicting future states and rewards. Instead we directly predict the return, which is what is needed for choosing optimal actions. Our Q-learning criteria also lends itself immediately to a function approximation setting where features are chosen based on the history. This algorithm is somewhat similar to the recent line of work on lasso temporal dierence learning which aims at nding a small feature set with which one can perform policy evaluation. The distinction is that we aim directly for learning the Q-function of the optimal policy and we use ‘0 instead of ‘1 regularisation. We perform an experimental evaluation on classical benchmark domains and nd improvement in convergence speed as well as in economy of the state representation. We also compare against MC-AIXI on the large Pocman domain and achieve competitive performance in average reward. We use less than half the CPU time and 36 times less memory. Overall, our algorithm hQL provides a better combination of computational, memory and data eciency than existing algorithms in this setting.

[1]  Tim Hesterberg,et al.  Monte Carlo Strategies in Scientific Computing , 2002, Technometrics.

[2]  Stephan Timmer,et al.  Safe Q-Learning on Complete History Spaces , 2007, ECML.

[3]  Maja J. Matarić,et al.  Learning to Use Selective Attention and Short-Term Memory in Sequential Tasks , 1996 .

[4]  Jonathan Baxter,et al.  Scaling Internal-State Policy-Gradient Methods for POMDPs , 2002 .

[5]  Marcus Hutter,et al.  Consistency of Feature Markov Processes , 2010, ALT.

[6]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[7]  Ronald Parr,et al.  Linear Complementarity for Regularized Policy Evaluation and Improvement , 2010, NIPS.

[8]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[9]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[10]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[11]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[12]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[13]  Andrew McCallum,et al.  Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State , 1995, ICML.

[14]  Matthew W. Hoffman,et al.  Finite-Sample Analysis of Lasso-TD , 2011, ICML.

[15]  Joel Veness,et al.  A Monte-Carlo AIXI Approximation , 2009, J. Artif. Intell. Res..

[16]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[17]  Shie Mannor,et al.  Regularized Fitted Q-iteration: Application to Planning , 2008, EWRL.

[18]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[19]  Marcus Hutter,et al.  Feature Reinforcement Learning in Practice , 2011, EWRL.

[20]  H. Robbins A Stochastic Approximation Method , 1951 .

[21]  Benjamin Van Roy,et al.  Universal Reinforcement Learning , 2007, IEEE Transactions on Information Theory.

[22]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[23]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[24]  Marcus Hutter,et al.  Feature Reinforcement Learning: Part I. Unstructured MDPs , 2009, J. Artif. Gen. Intell..

[25]  Matthieu Geist,et al.  A Dantzig Selector Approach to Temporal Difference Learning , 2012, ICML.

[26]  Douglas Aberdeen,et al.  Scalable Internal-State Policy-Gradient Methods for POMDPs , 2002, ICML.

[27]  Marcus Hutter,et al.  Context Tree Maximizing , 2012, AAAI.

[28]  Marcus Hutter,et al.  Universal Artificial Intellegence - Sequential Decisions Based on Algorithmic Probability , 2005, Texts in Theoretical Computer Science. An EATCS Series.

[29]  Marcus Hutter,et al.  Feature Reinforcement Learning using Looping Suffix Trees , 2012, EWRL.

[30]  Andrew McCallum,et al.  Learning to Use Selective Attention and Short-Term Memory in Sequential Tasks , 1996 .