Evolving Robust Policy Coverage Sets in Multi-Objective Markov Decision Processes Through Intrinsically Motivated Self-Play

Many real-world decision-making problems involve multiple conflicting objectives that can not be optimized simultaneously without a compromise. Such problems are known as multi-objective Markov decision processes and they constitute a significant challenge for conventional single-objective reinforcement learning methods, especially when an optimal compromise cannot be determined beforehand. Multi-objective reinforcement learning methods address this challenge by finding an optimal coverage set of non-dominated policies that can satisfy any user's preference in solving the problem. However, this is achieved with costs of computational complexity, time consumption, and lack of adaptability to non-stationary environment dynamics. In order to address these limitations, there is a need for adaptive methods that can solve the problem in an online and robust manner. In this paper, we propose a novel developmental method that utilizes the adversarial self-play between an intrinsically motivated preference exploration component, and a policy coverage set optimization component that robustly evolves a convex coverage set of policies to solve the problem using preferences proposed by the former component. We show experimentally the effectiveness of the proposed method in comparison to state-of-the-art multi-objective reinforcement learning methods in stationary and non-stationary environments.

[1]  Jürgen Schmidhuber,et al.  A possibility for implementing curiosity and boredom in model-building neural controllers , 1991 .

[2]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[3]  Patrice Perny,et al.  On Minimizing Ordered Weighted Regrets in Multiobjective Markov Decision Processes , 2011, ADT.

[4]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[5]  Shimon Whiteson,et al.  Linear support for multi-objective coordination graphs , 2014, AAMAS.

[6]  Shimon Whiteson,et al.  A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..

[7]  Kathryn E. Merrick,et al.  Motivated Reinforcement Learning - Curious Characters for Multiuser Games , 2009 .

[8]  Richard L. Lewis,et al.  Where Do Rewards Come From , 2009 .

[9]  Peter Geibel,et al.  Reinforcement Learning for MDPs with Constraints , 2006, ECML.

[10]  Eyke Hüllermeier,et al.  Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm , 2014, Machine Learning.

[11]  Marcello Restelli,et al.  A multiobjective reinforcement learning approach to water resources systems operation: Pareto frontier approximation in a single run , 2013 .

[12]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[13]  Jürgen Schmidhuber,et al.  Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[14]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[15]  Ann Nowé,et al.  Scalarized multi-objective reinforcement learning: Novel design techniques , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[16]  Andrew G. Barto,et al.  Intrinsic Motivation and Reinforcement Learning , 2013, Intrinsically Motivated Learning in Natural and Artificial Systems.

[17]  S. M. Arnsten Intrinsic motivation. , 1990, The American journal of occupational therapy : official publication of the American Occupational Therapy Association.

[18]  Patrice Perny,et al.  On Finding Compromise Solutions in Multiobjective Markov Decision Processes , 2010, ECAI.

[19]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[20]  Pierre-Yves Oudeyer,et al.  Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[21]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[22]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[23]  Konkoly Thege Multi-criteria Reinforcement Learning , 1998 .

[24]  Shimon Whiteson,et al.  Point-Based Planning for Multi-Objective POMDPs , 2015, IJCAI.

[25]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[26]  Yuichiro Yoshikawa,et al.  Intrinsically motivated reinforcement learning for human-robot interaction in the real-world , 2018, Neural Networks.

[27]  Susan A. Murphy,et al.  Efficient Reinforcement Learning with Multiple Reward Functions for Randomized Controlled Trial Analysis , 2010, ICML.

[28]  Eyke Hüllermeier,et al.  Preference-based reinforcement learning: a formal framework and a policy iteration algorithm , 2012, Mach. Learn..

[29]  Paul M. B. Vitányi,et al.  An Introduction to Kolmogorov Complexity and Its Applications, Third Edition , 1997, Texts in Computer Science.

[30]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[31]  Martin Ester,et al.  Density‐based clustering , 2019, WIREs Data Mining Knowl. Discov..

[32]  Pierre-Yves Oudeyer,et al.  What is Intrinsic Motivation? A Typology of Computational Approaches , 2007, Frontiers Neurorobotics.

[33]  E. Deci,et al.  Intrinsic and Extrinsic Motivations: Classic Definitions and New Directions. , 2000, Contemporary educational psychology.

[34]  Eugene A. Feinberg,et al.  Constrained Markov Decision Models with Weighted Discounted Rewards , 1995, Math. Oper. Res..

[35]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[36]  Jürgen Schmidhuber,et al.  Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[37]  A. Shamsai,et al.  Multi-objective Optimization , 2017, Encyclopedia of Machine Learning and Data Mining.

[38]  Nicola Beume,et al.  On the Complexity of Computing the Hypervolume Indicator , 2009, IEEE Transactions on Evolutionary Computation.

[39]  Lotfi A. Zadeh,et al.  Fuzzy logic = computing with words , 1996, IEEE Trans. Fuzzy Syst..

[40]  Michèle Sebag,et al.  Preference-Based Policy Learning , 2011, ECML/PKDD.

[41]  Evan Dekker,et al.  Empirical evaluation methods for multiobjective reinforcement learning algorithms , 2011, Machine Learning.