Reinforcement Learning with Factored States and Actions

A novel approximation method is presented for approximating the value function and selecting good actions for Markov decision processes with large state and action spaces. The method approximates state-action values as negative free energies in an undirected graphical model called a product of experts. The model parameters can be learned efficiently because values and derivatives can be efficiently computed for a product of experts. Actions can be found even in large factored action spaces by the use of Markov chain Monte Carlo sampling. Simulation results show that the product of experts approximation can be used to solve large problems. In one simulation it is used to find actions in action spaces of size 240.

[1]  R. B. Potts Some generalized order-disorder transformations , 1952, Mathematical Proceedings of the Cambridge Philosophical Society.

[2]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[3]  R. Bellman A Markovian Decision Process , 1957 .

[4]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[5]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[6]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[7]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[9]  D. Rumelhart Learning internal representations by back-propagating errors , 1986 .

[10]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[11]  Geoffrey E. Hinton,et al.  Learning representations by back-propagation errors, nature , 1986 .

[12]  S. Duane,et al.  Hybrid Monte Carlo , 1987 .

[13]  G. B. Smith,et al.  Preface to S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images” , 1987 .

[14]  Keiji Kanazawa,et al.  A model for reasoning about persistence and causation , 1989 .

[15]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[16]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[17]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[18]  David Haussler,et al.  Unsupervised learning of distributions on binary vectors using two layer networks , 1991, NIPS 1991.

[19]  W. Lovejoy A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[20]  Radford M. Neal Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[21]  Leemon C Baird,et al.  Reinforcement Learning With High-Dimensional, Continuous Actions , 1993 .

[22]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[23]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[24]  Michael I. Jordan,et al.  Reinforcement Learning by Probability Matching , 1995, NIPS 1995.

[25]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[26]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[27]  Craig Boutilier,et al.  Computing Optimal Policies for Partially Observable Decision Processes Using Compact Representations , 1996, AAAI/IAAI, Vol. 2.

[28]  Michael I. Jordan,et al.  Variational methods for inference and estimation in graphical models , 1997 .

[29]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[30]  Ashwin Ram,et al.  Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces , 1997, Adapt. Behav..

[31]  Doina Precup,et al.  Theoretical Results on Reinforcement Learning with Temporally Abstract Options , 1998, ECML.

[32]  Kee-Eung Kim,et al.  Solving Stochastic Planning Problems with Large State and Action Spaces , 1998, AIPS.

[33]  Kee-Eung Kim,et al.  Solving Very Large Weakly Coupled Markov Decision Processes , 1998, AAAI/IAAI.

[34]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[35]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[36]  Amy McGovern,et al.  AcQuire-macros: An Algorithm for Automatically Learning Macro-actions , 1998 .

[37]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[38]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[39]  Brian Sallans,et al.  Learning Factored Representations for Partially Observable Markov Decision Processes , 1999, NIPS.

[40]  Leslie Pack Kaelbling,et al.  Learning Policies with External Memory , 1999, ICML.

[41]  David A. McAllester,et al.  Approximate Planning for Factored POMDPs using Belief State Simplification , 1999, UAI.

[42]  Daphne Koller,et al.  Reinforcement Learning Using Approximate Belief States , 1999, NIPS.

[43]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[44]  Ronen I. Brafman,et al.  Reasoning With Conditional Ceteris Paribus Preference Statements , 1999, UAI.

[45]  Sebastian Thrun,et al.  Monte Carlo POMDPs , 1999, NIPS.

[46]  Kee-Eung Kim,et al.  Learning to Cooperate via Policy Search , 2000, UAI.

[47]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[48]  Craig Boutilier,et al.  Stochastic dynamic programming with factored representations , 2000, Artif. Intell..

[49]  Craig Boutilier,et al.  Vector-space Analysis of Belief-state Approximation for POMDPs , 2001, UAI.

[50]  Ronen I. Brafman,et al.  UCP-Networks: A Directed Graphical Representation of Conditional Utilities , 2001, UAI.

[51]  Geoffrey E. Hinton,et al.  Products of Hidden Markov Models , 2001, AISTATS.

[52]  Carlos Guestrin,et al.  Multiagent Planning with Factored MDPs , 2001, NIPS.

[53]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[54]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[55]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[56]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[57]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[58]  Philipp Slusallek,et al.  Introduction to real-time ray tracing , 2005, SIGGRAPH Courses.