Utilizing negative policy information to accelerate reinforcement learning

ACKNOWLEDGEMENTS One consequence of my long tenure at Georgia Tech has been the opportunity to get to know a large number of exceptional people. I'm thankful for tremendous support from some particularly exceptional people: advisor Charles Isbell, a gentleman and a scholar, who gave me the opportunity to think outside the box; and my adoptive advisor Andrea Thomaz, who helped me see it's also quite a good idea to stop and focus on the box long enough to get a bow onto it.

[1]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[2]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[3]  Andrea Lockerd Thomaz,et al.  Reinforcement Learning with Human Teachers: Evidence of Feedback and Guidance with Implications for Learning Performance , 2006, AAAI.

[4]  Eric Wiewiora,et al.  Potential-Based Shaping and Q-Value Initialization are Equivalent , 2003, J. Artif. Intell. Res..

[5]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[6]  Andrea Lockerd Thomaz,et al.  Teachable robots: Understanding human teaching behavior to build more effective robot learners , 2008, Artif. Intell..

[7]  Julian Togelius,et al.  Mario AI competition , 2009, 2009 IEEE Symposium on Computational Intelligence and Games.

[8]  Aude Billard,et al.  Donut as I do: Learning from failed demonstrations , 2011, 2011 IEEE International Conference on Robotics and Automation.

[9]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[10]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[11]  Scott Kuindersma,et al.  Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories , 2010, NIPS.

[12]  Peter Stone,et al.  Reinforcement learning from simultaneous human and MDP reward , 2012, AAMAS.

[13]  Rémi Gilleron,et al.  Learning from positive and unlabeled examples , 2000, Theor. Comput. Sci..

[14]  K. Subramanian,et al.  Learning Options through Human Interaction , 2011 .

[15]  Andrew G. Barto,et al.  Efficient skill learning using abstraction selection , 2009, IJCAI 2009.

[16]  J Glaser,et al.  Separation of Concerns , 2014 .

[17]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[18]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[19]  Bing Liu,et al.  Learning with Positive and Unlabeled Examples Using Weighted Logistic Regression , 2003, ICML.

[20]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[21]  Peng Zhou,et al.  Discovering options from example trajectories , 2009, ICML '09.

[22]  Marco Colombetti,et al.  Robot Shaping: An Experiment in Behavior Engineering , 1997 .

[23]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[24]  Julian Togelius,et al.  The 2009 Mario AI Competition , 2010, IEEE Congress on Evolutionary Computation.

[25]  Peng Zang,et al.  Scaling solutions to Markov Decision Problems , 2011 .

[26]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[27]  Peter Stone,et al.  Interactively shaping agents via human reinforcement: the TAMER framework , 2009, K-CAP '09.

[28]  Daphne Koller,et al.  Computing Factored Value Functions for Policies in Structured MDPs , 1999, IJCAI.

[29]  Shobha Venkataraman,et al.  Efficient Solution Algorithms for Factored MDPs , 2003, J. Artif. Intell. Res..

[30]  Craig Boutilier,et al.  Exploiting Structure in Policy Construction , 1995, IJCAI.

[31]  Kee-Eung Kim,et al.  Solving Very Large Weakly Coupled Markov Decision Processes , 1998, AAAI/IAAI.

[32]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[33]  Preben Alstrøm,et al.  Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.

[34]  Mitsuo Kawato,et al.  Inter-module credit assignment in modular reinforcement learning , 2003, Neural Networks.

[35]  Andrew G. Barto,et al.  Using relative novelty to identify useful temporal abstractions in reinforcement learning , 2004, ICML.

[36]  Brett Browning,et al.  Learning robot motion control with demonstration and advice-operators , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[37]  Doina Precup,et al.  Learning Options in Reinforcement Learning , 2002, SARA.

[38]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[39]  Shie Mannor,et al.  Dynamic abstraction in reinforcement learning via clustering , 2004, ICML.

[40]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[41]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.