Reinforcement Learning with Heterogeneous Policy Representations

In Reinforcement Learning (RL) the goal is to find a policy π that maximizes the expected future return, calculated based on a scalar reward function R(·) ∈ R. The policy π determines what actions will be performed by the RL agent. Traditionally, the RL problem is formulated in terms of a Markov Decision Process (MDP) or a Partially Observable MDP (POMDP). In this formulation, the policy π is viewed as a mapping function (π : s 7−→ a) from state s ∈ S to action a ∈ A. This approach, however, suffers severely from the curse of dimensionality.

[1]  Darwin G. Caldwell,et al.  Simultaneous discovery of multiple alternative optimal policies by reinforcement learning , 2012, 2012 6th IEEE International Conference Intelligent Systems.

[2]  Tom Schaul,et al.  Exploring parameter space in reinforcement learning , 2010, Paladyn J. Behav. Robotics.

[3]  Andrew G. Barto,et al.  Robot Weightlifting By Direct Policy Search , 2001, IJCAI.

[4]  Nando de Freitas,et al.  Sequential Monte Carlo Methods in Practice , 2001, Statistics for Engineering and Information Science.

[5]  Darwin G. Caldwell,et al.  Direct policy search reinforcement learning based on particle filtering , 2012, EWRL 2012.