Dealing with multiple experts and non-stationarity in inverse reinforcement learning: an application to real-life problems

In real-world applications, inferring the intentions of expert agents (e.g., human operators) can be fundamental to understand how possibly conflicting objectives are managed, helping to interpret the demonstrated behavior. In this paper, we discuss how inverse reinforcement learning (IRL) can be employed to retrieve the reward function implicitly optimized by expert agents acting in real applications. Scaling IRL to real-world cases has proved challenging as typically only a fixed dataset of demonstrations is available and further interactions with the environment are not allowed. For this reason, we resort to a class of truly batch model-free IRL algorithms and we present three application scenarios: (1) the high-level decision-making problem in the highway driving scenario, and (2) inferring the user preferences in a social network (Twitter), and (3) the management of the water release in the Como Lake. For each of these scenarios, we provide formalization, experiments and a discussion to interpret the obtained results.

[1]  Olivier Ledoit,et al.  A well-conditioned estimator for large-dimensional covariance matrices , 2004 .

[2]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[3]  Alessandro Lazaric,et al.  Fighting Boredom in Recommender Systems with Linear Reinforcement Learning , 2018, NeurIPS.

[4]  Pieter Abbeel,et al.  An Algorithmic Perspective on Imitation Learning , 2018, Found. Trends Robotics.

[5]  John G. Breslin,et al.  Inferring user interests in microblogging social networks: a survey , 2018, User Modeling and User-Adapted Interaction.

[6]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[7]  Peter Englert,et al.  Probabilistic model-based imitation learning , 2013, Adapt. Behav..

[8]  Hai Jin,et al.  Graph Processing on GPUs , 2018, ACM Comput. Surv..

[9]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[10]  Marcello Restelli,et al.  Inverse Reinforcement Learning through Policy Gradient Minimization , 2016, AAAI.

[11]  A. Castelletti,et al.  A coupled human‐natural systems analysis of irrigated agriculture under changing climate , 2016 .

[12]  J. Andrew Bagnell,et al.  Maximum margin planning , 2006, ICML.

[13]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[14]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[15]  Ken Goldberg,et al.  Deep Imitation Learning for Complex Manipulation Tasks from Virtual Reality Teleoperation , 2017, ICRA.

[16]  Marcello Restelli,et al.  Truly Batch Model-Free Inverse Reinforcement Learning about Multiple Intentions , 2020, AISTATS.

[17]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[18]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[19]  Satish V. Ukkusuri,et al.  Joint inference of user community and interest patterns in social interaction networks , 2017, Social Network Analysis and Mining.

[20]  Alexey Dosovitskiy,et al.  End-to-End Driving Via Conditional Imitation Learning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[21]  Michael Kearns,et al.  Reinforcement learning for optimized trade execution , 2006, ICML.

[22]  Qing Yang,et al.  Discovering User Interest on Twitter with a Modified Author-Topic Model , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[23]  Andrea Castelletti,et al.  Identifying and Modeling Dynamic Preference Evolution in Multipurpose Water Resources Systems , 2018 .

[24]  Jan Peters,et al.  Policy Search for Motor Primitives in Robotics , 2008, NIPS 2008.

[25]  Sanmay Das,et al.  The effects of feedback on human behavior in social media: an inverse reinforcement learning model , 2014, AAMAS.

[26]  Yi-Shin Chen,et al.  A Dynamic Influence Keyword Model for Identifying Implicit User Interests on Social Networks , 2017, ASONAM.

[27]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[28]  Michael A. H. Dempster,et al.  Intraday FX Trading: An Evolutionary Reinforcement Learning Approach , 2002, IDEAL.

[29]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[30]  Nicolas Vayatis,et al.  A review of change point detection methods , 2018, ArXiv.

[31]  Mohamed Medhat Gaber,et al.  Imitation Learning , 2017, ACM Comput. Surv..

[32]  Matthieu Geist,et al.  A Cascaded Supervised Learning Approach to Inverse Reinforcement Learning , 2013, ECML/PKDD.

[33]  Dushyant Rao,et al.  Large-scale cost function learning for path planning using deep inverse reinforcement learning , 2017, Int. J. Robotics Res..

[34]  Luis Montesano,et al.  Learning multiple behaviours using hierarchical clustering of rewards , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[35]  J. Teichmann,et al.  Deep hedging , 2019, Quantitative Finance.

[36]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[37]  Pieter Abbeel,et al.  Autonomous Helicopter Aerobatics through Apprenticeship Learning , 2010, Int. J. Robotics Res..

[38]  Luming Zhang,et al.  Interest Inference via Structure-Constrained Multi-Source Multi-Task Learning , 2015, IJCAI.

[39]  Jan Peters,et al.  Relative Entropy Inverse Reinforcement Learning , 2011, AISTATS.

[40]  Martial Hebert,et al.  Learning monocular reactive UAV control in cluttered natural environments , 2012, 2013 IEEE International Conference on Robotics and Automation.

[41]  Guy Shani,et al.  An MDP-Based Recommender System , 2002, J. Mach. Learn. Res..

[42]  Marcello Restelli,et al.  Compatible Reward Inverse Reinforcement Learning , 2017, NIPS.

[43]  Diane J. Cook,et al.  A survey of methods for time series change point detection , 2017, Knowledge and Information Systems.

[44]  Marcello Restelli,et al.  Smoothing policies and safe policy gradients , 2019, Machine Learning.

[45]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[46]  Michael L. Littman,et al.  Apprenticeship Learning About Multiple Intentions , 2011, ICML.

[47]  Andrea Bonarini,et al.  Gradient-based minimization for multi-expert Inverse Reinforcement Learning , 2017, 2017 IEEE Symposium Series on Computational Intelligence (SSCI).

[48]  Wolfram Burgard,et al.  Learning driving styles for autonomous vehicles from demonstration , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[49]  David Silver,et al.  Learning Autonomous Driving Styles and Maneuvers from Expert Demonstration , 2012, ISER.

[50]  Richard Bellman,et al.  ON A ROUTING PROBLEM , 1958 .

[51]  Matthieu Geist,et al.  Inverse Reinforcement Learning through Structured Classification , 2012, NIPS.

[52]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[53]  A. Castelletti,et al.  Detecting the State of the Climate System via Artificial Intelligence to Improve Seasonal Forecasts and Inform Reservoir Operations , 2019, Water Resources Research.

[54]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[55]  Daniel Krajzewicz,et al.  Recent Development and Applications of SUMO - Simulation of Urban MObility , 2012 .

[56]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[57]  Prashant Doshi,et al.  Multi-robot inverse reinforcement learning under occlusion with interactions , 2014, AAMAS.

[58]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[59]  Marcello Restelli,et al.  Combining reinforcement learning with rule-based controllers for transparent and general decision-making in autonomous driving , 2020, Robotics Auton. Syst..

[60]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .