Reinforcement learning in multidimensional continuous action spaces

The majority of learning algorithms available today focus on approximating the state (V ) or state-action (Q) value function and efficient action selection comes as an afterthought. On the other hand, real-world problems tend to have large action spaces, where evaluating every possible action becomes impractical. This mismatch presents a major obstacle in successfully applying reinforcement learning to real-world problems. In this paper we present an effective approach to learning and acting in domains with multidimensional and/or continuous control variables where efficient action selection is embedded in the learning process. Instead of learning and representing the state or state-action value function of the MDP, we learn a value function over an implied augmented MDP, where states represent collections of actions in the original MDP and transitions represent choices eliminating parts of the action space at each step. Action selection in the original MDP is reduced to a binary search by the agent in the transformed MDP, with computational complexity logarithmic in the number of actions, or equivalently linear in the number of action dimensions. Our method can be combined with any discrete-action reinforcement learning algorithm for learning multidimensional continuous-action policies using a state value approximator in the transformed MDP. Our preliminary results with two well-known reinforcement learning algorithms (Least-Squares Policy Iteration and Fitted Q-Iteration) on two continuous action domains (1-dimensional inverted pendulum regulator, 2-dimensional bicycle balancing) demonstrate the viability and the potential of the proposed approach.

[1]  James S. Albus,et al.  I A New Approach to Manipulator Control: The I Cerebellar Model Articulation Controller , 1975 .

[2]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[3]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[4]  Kazuo Tanaka,et al.  An approach to fuzzy control of nonlinear systems: stability and design issues , 1996, IEEE Trans. Fuzzy Syst..

[5]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[6]  Ashwin Ram,et al.  Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces , 1997, Adapt. Behav..

[7]  Martin A. Riedmiller Application of a self-learning controller with continuous control signals based on the DOE-approach , 1997, ESANN.

[8]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[9]  José del R. Millán,et al.  Continuous-Action Q-Learning , 2002, Machine Learning.

[10]  Geoffrey E. Hinton,et al.  Reinforcement Learning with Factored States and Actions , 2004, J. Mach. Learn. Res..

[11]  Benjamin Van Roy,et al.  On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming , 2004, Math. Oper. Res..

[12]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[13]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[15]  Reinforcement learning in multi-dimensional state-action space using random rectangular coarse coding and Gibbs sampling , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[16]  Andrea Bonarini,et al.  Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods , 2007, NIPS.

[17]  H. Martín,et al.  Ex〈α〉: An effective algorithm for continuous actions Reinforcement Learning problems , 2009 .

[18]  Michail G. Lagoudakis,et al.  Learning continuous-action control policies , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[19]  Michail G. Lagoudakis,et al.  Binary action search for learning continuous-action control policies , 2009, ICML '09.

[20]  Jason Pazis,et al.  Non-Parametric Approximate Linear Programming for MDPs , 2011, AAAI.

[21]  Jason Pazis,et al.  Generalized Value Functions for Large Action Sets , 2011, ICML.