论文信息 - Reinforcement learning for factored Markov decision processes

Reinforcement learning for factored Markov decision processes

Learning to act optimally in a complex, dynamic and noisy environment is a hard problem. Various threads of research from reinforcement learning, animal conditioning, operations research, machine learning, statistics and optimal control are beginning to come together to offer solutions to this problem. I present a thesis in which novel algorithms are presented for learning the dynamics, learning the value function, and selecting good actions for Markov decision processes. The problems considered have high-dimensional factored state and action spaces, and are either fully or partially observable. The approach I take is to recognize similarities between the problems being solved in the reinforcement learning and graphical models literature, and to use and combine techniques from the two fields in novel ways. In particular I present two new algorithms. First, the DBN algorithm learns a compact representation of the core process of a partially observable MDP. Because inference in the DBN is intractable, I use approximate inference to maintain the belief state. A belief state action-value function is learned using reinforcement learning. I show that this DBN algorithm can solve POMDPs with very large state spaces and useful hidden state. Second, the PoE algorithm learns an approximation to value functions over large factored state-action spaces. The algorithm approximates values as (negative) free energies in a product of experts model. The model parameters can be learned efficiently because inference is tractable in a product of experts. I show that good actions can be found even in large factored action spaces by the use of brief Gibbs sampling. These two new algorithms take techniques from the machine learning community and apply them in new ways to reinforcement learning problems. Simulation results show that these new methods can be used to solve very large problems. The DBN method is used to solve a POMDP with a hidden state space and an observation space of size greater than 2180. The DBN model of the core process has 232 states represented as 32 binary variables. The PoE method is used to find actions in action spaces of size 240 .

Geoffrey E. Hinton | B. Sallans

[1] R. Bellman. A Markovian Decision Process , 1957 .

[2] R. Bellman,et al. Dynamic Programming and Markov Processes , 1960 .

[3] L. Baum,et al. Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[4] L. Baum,et al. A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[5] Edward J. Sondik,et al. The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[6] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7] Edward J. Sondik,et al. The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs , 1978, Oper. Res..

[8] R. Shumway,et al. AN APPROACH TO TIME SERIES SMOOTHING AND FORECASTING USING THE EM ALGORITHM , 1982 .

[9] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[10] Donald Geman,et al. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11] Geoffrey E. Hinton,et al. A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..