Learning Action Representations for Reinforcement Learning

Most model-free reinforcement learning methods leverage state representations (embeddings) for generalization, but either ignore structure in the space of actions or assume the structure is provided a priori. We show how a policy can be decomposed into a component that acts in a low-dimensional space of action representations and a component that transforms these representations into actual actions. These representations improve generalization over large, finite action sets by allowing the agent to infer the outcomes of actions similar to actions already taken. We provide an algorithm to both learn and use action representations and provide conditions for its convergence. The efficacy of the proposed method is demonstrated on large-scale real-world problems.

[1]  Vivek S. Borkar,et al.  The actor-critic algorithm as multi-time-scale stochastic approximation , 1997 .

[2]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[3]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[4]  E Bizzi,et al.  Motor learning through the combination of primitives. , 2000, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[5]  Zoubin Ghahramani,et al.  An Introduction to Hidden Markov Models and Bayesian Networks , 2001, Int. J. Pattern Recognit. Artif. Intell..

[6]  Jun Nakanishi,et al.  Learning Attractor Landscapes for Learning Motor Primitives , 2002, NIPS.

[7]  Guy Shani,et al.  An MDP-Based Recommender System , 2002, J. Mach. Learn. Res..

[8]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[9]  Scotty D. Craig,et al.  Integrating Affect Sensors in an Intelligent Tutoring System , 2004 .

[10]  Geoffrey E. Hinton,et al.  Reinforcement Learning with Factored States and Actions , 2004, J. Mach. Learn. Res..

[11]  M. Lemay,et al.  Modularity of motor output evoked by intraspinal microstimulation in cats. , 2004, Journal of neurophysiology.

[12]  J. Jing,et al.  The Construction of Movement with Behavior-Specific and Behavior-Independent Modules , 2004, The Journal of Neuroscience.

[13]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14]  S. Schaal Dynamic Movement Primitives -A Framework for Motor Control in Humans and Humanoid Robotics , 2006 .

[15]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[16]  Jan Peters,et al.  Learning motor primitives for robotics , 2009, 2009 IEEE International Conference on Robotics and Automation.

[17]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[18]  Marco Wiering,et al.  Using continuous action spaces to solve discrete problems , 2009, 2009 International Joint Conference on Neural Networks.

[19]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[20]  Philip S. Thomas,et al.  Policy Gradient Coagent Networks , 2011, NIPS.

[21]  Jason Pazis,et al.  Generalized Value Functions for Large Action Sets , 2011, ICML.

[22]  Andrew G. Barto,et al.  Conjugate Markov Decision Processes , 2011, ICML.

[23]  George Konidaris,et al.  Value Function Approximation in Reinforcement Learning Using the Fourier Basis , 2011, AAAI.

[24]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[25]  Andrew G. Barto,et al.  Motor primitive discovery , 2012, 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[26]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[27]  Philip Thomas,et al.  Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[28]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[29]  Richard Evans,et al.  Deep Reinforcement Learning in Large Discrete Action Spaces , 2015, 1512.07679.

[30]  Philip S. Thomas,et al.  Ad Recommendation Systems for Life-Time Value Optimization , 2015, WWW.

[31]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[32]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[33]  Cordelia Schmid,et al.  Label-Embedding for Image Classification , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Roni Khardon,et al.  Online Symbolic Gradient-Based Optimization for Factored Action MDPs , 2016, IJCAI.

[35]  René Vidal,et al.  Global Optimality in Neural Network Training , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[37]  Balaraman Ravindran,et al.  Learning to Factor Policies and Action-Value Functions: Factored Action Space Representations for Deep Reinforcement learning , 2017, ArXiv.

[38]  Damien Ernst,et al.  Reinforcement Learning for Electric Power System Decision and Control: Past Considerations and Perspectives , 2017 .

[39]  Zhengyao Jiang,et al.  A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem , 2017, ArXiv.

[40]  Trevor Darrell,et al.  Loss is its own Reward: Self-Supervision for Reinforcement Learning , 2016, ICLR.

[41]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[42]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[43]  Roni Khardon,et al.  Lifted Stochastic Planning, Belief Propagation and Marginal MAP , 2018, AAAI Workshops.

[44]  Yonina C. Eldar,et al.  The Global Optimization Geometry of Shallow Linear Neural Networks , 2018, Journal of Mathematical Imaging and Vision.

[45]  Shie Mannor,et al.  The Natural Language of Actions , 2019, ICML.

[46]  Joelle Pineau,et al.  Combined Reinforcement Learning via Abstract Representations , 2018, AAAI.