Knowledge-based reward shaping with knowledge revision in reinforcement learning

Reinforcement learning has proven to be a successful artificial intelligence technique when an agent needs to act and improve in a given environment. The agent receives feedback about its behaviour in terms of rewards through constant interaction with the environment and in time manages to identify which actions are more beneficial for each situation. Typically reinforcement learning assumes the agent has no prior knowledge about the environment it is acting on. Nevertheless, in many cases (potentially abstract and heuristic) domain knowledge of the reinforcement learning tasks is available by domain experts, and can be used toimprove the learning performance. One way of imparting knowledge to an agent is through reward shaping which guides an agent by providing additional rewards. One common assumption when imparting knowledge to an agent, is that the domain knowledge is always correct. Given that the provided knowledge is of a heuristic nature, there are cases when this assumption is not met and it has been shown that in cases where the provided knowledge is wrong, the agent takes longer to learn the optimal policy. As reinforcement learning methods are shifting more towards informed agents, the assumption that expert domain knowledge is always correct needs to be relaxed in order to scale these methods to more complex, real-life scenarios. To accomplish that, the agents need to have a mechanism to deal with those cases where the provided expert knowledge is not perfect. This thesis investigates and documents the adverse effects erroneous knowledge can have to the learning process of an agent if care is not taken. Moreover, it provides a novel approach to deal with erroneous knowledge through the use of knowledge revision principles, in order to allow agents to use their experiences to revise knowledge and thus benefit from more accurate shaping. Empirical evaluation shows that agents that are able to revise erroneous parts of the provided knowledge, can reach better policies faster when compared to agents that do not have knowledge revision capabilities.

[1]  Mark A. Peot,et al.  Conditional nonlinear planning , 1992 .

[2]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[3]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[4]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[5]  Richard Fikes,et al.  STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving , 1971, IJCAI.

[6]  Jude W. Shavlik,et al.  Advice Refinement in Knowledge-Based SVMs , 2011, NIPS.

[7]  Jude W. Shavlik,et al.  Refining Rules Incorporated into Knowledge-Based Support Vector Learners Via Successive Linear Programming , 2007, AAAI.

[8]  M. Grzes,et al.  Plan-based reward shaping for reinforcement learning , 2008, 2008 4th International IEEE Conference Intelligent Systems.

[9]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[10]  Mong-Li Lee,et al.  Distributed relational temporal difference learning , 2013, AAMAS.

[11]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[12]  Amal El Fallah Seghrouchni,et al.  Multi-Agent Planning , 2012, Software Agents, Agent Systems and Their Applications.

[13]  Hector Muñoz-Avila,et al.  RETALIATE: Learning Winning Policies in First-Person Shooter Games , 2007, AAAI.

[14]  PETER GÄRDENFORS,et al.  Belief Revision: Belief revision: An introduction , 2003 .

[15]  Gabriele Kern-Isberner,et al.  Conditionals in Nonmonotonic Reasoning and Belief Revision: Considering Conditionals as Agents , 2001 .

[16]  Gabriele Kern-Isberner,et al.  Combining Reinforcement Learning and Belief Revision - A Learning System for Active Vision , 2008, BMVC.

[17]  Kurt Driessens,et al.  Relational Reinforcement Learning , 1998, Machine-mediated learning.

[18]  S. Rosenschein,et al.  On social laws for artificial agent societies: off-line design , 1996 .

[19]  Daniel Kudenko,et al.  Multigrid Reinforcement Learning with Reward Shaping , 2008, ICANN.

[20]  Preben Alstrøm,et al.  Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.

[21]  Sam Devlin,et al.  An Empirical Study of Potential-Based Reward Shaping and Advice in Complex, Multi-Agent Systems , 2011, Adv. Complex Syst..

[22]  Héctor Muñoz-Avila,et al.  CLASSQ-L: A Q-Learning Algorithm for Adversarial Real-Time Strategy Games , 2012, Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment.

[23]  Julia Rose Galliers Belief Revision: Autonomous belief revision and communication , 1992 .

[24]  Ashwin Ram,et al.  Transfer Learning in Real-Time Strategy Games Using Hybrid CBR/RL , 2007, IJCAI.

[25]  Bhaskara Marthi,et al.  Automatic shaping and decomposition of reward functions , 2007, ICML '07.

[26]  Garrison W. Cottrell,et al.  Principled Methods for Advising Reinforcement Learning Agents , 2003, ICML.

[27]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[28]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[29]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[30]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[31]  András Lörincz,et al.  Learning Tetris Using the Noisy Cross-Entropy Method , 2006, Neural Computation.

[32]  Michael L. Littman,et al.  Potential-based Shaping in Model-based Reinforcement Learning , 2008, AAAI.

[33]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[34]  Peter Gärdenfors,et al.  On the logic of theory change: Partial meet contraction and revision functions , 1985, Journal of Symbolic Logic.

[35]  Ming Tan,et al.  Multi-Agent Reinforcement Learning: Independent versus Cooperative Agents , 1997, ICML.

[36]  Ian D. Watson,et al.  Applying reinforcement learning to small scale combat in the real-time strategy game StarCraft:Broodwar , 2012, 2012 IEEE Conference on Computational Intelligence and Games (CIG).

[37]  Sam Devlin,et al.  Theoretical considerations of potential-based reward shaping for multi-agent systems , 2011, AAMAS.

[38]  Peter Norvig,et al.  Artificial intelligence - a modern approach, 2nd Edition , 2003, Prentice Hall series in artificial intelligence.

[39]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control 3rd Edition, Volume II , 2010 .

[40]  J. Nash,et al.  NON-COOPERATIVE GAMES , 1951, Classics in Game Theory.

[41]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[42]  Jeffrey S. Rosenschein,et al.  Synchronization of Multi-Agent Plans , 1982, AAAI.

[43]  Sam Devlin,et al.  Dynamic potential-based reward shaping , 2012, AAMAS.

[44]  Aaron Hunter,et al.  Iterated Belief Change Due to Actions and Observations , 2011, J. Artif. Intell. Res..

[45]  Sam Devlin,et al.  Plan-based reward shaping for multi-agent reinforcement learning , 2016, The Knowledge Engineering Review.

[46]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.