Policy Supervectors: General Characterization of Agents by their Behaviour

By studying the underlying policies of decision-making agents, we can learn about their shortcomings and potentially improve them. Traditionally, this has been done either by examining the agent's implementation, its behaviour while it is being executed, its performance with a reward/fitness function or by visualizing the density of states the agent visits. However, these methods fail to describe the policy's behaviour in complex, high-dimensional environments or do not scale to thousands of policies, which is required when studying training algorithms. We propose policy supervectors for characterizing agents by the distribution of states they visit, adopting successful techniques from the area of speech technology. Policy supervectors can characterize policies regardless of their design philosophy (e.g. rule-based vs. neural networks) and scale to thousands of policies on a single workstation machine. We demonstrate method's applicability by studying the evolution of policies during reinforcement learning, evolutionary training and imitation learning, providing insight on e.g. how the search space of evolutionary algorithms is also reflected in agent's behaviour, not just in the parameters.

[1]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[2]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[3]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[4]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[5]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[6]  Sam Devlin,et al.  A Comparison of Self-Play Algorithms Under a Generalized Framework , 2020, IEEE Transactions on Games.

[7]  Tomi Kinnunen,et al.  Comparative evaluation of maximum a Posteriori vector quantization and gaussian mixture models in speaker verification , 2009, Pattern Recognit. Lett..

[8]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[9]  Daniel Cohen,et al.  Evaluating the Performance of Reinforcement Learning Algorithms , 2020, ICML.

[10]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[11]  Jerry Zikun Chen Reinforcement Learning Generalization with Surprise Minimization , 2020, ArXiv.

[12]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[13]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[14]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  Nando de Freitas,et al.  Robust Imitation of Diverse Behaviors , 2017, NIPS.

[16]  Adam Gaier,et al.  Weight Agnostic Neural Networks , 2019, NeurIPS.

[17]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[18]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[19]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20]  Larry Rudolph,et al.  Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO , 2020, ArXiv.

[21]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[22]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[23]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[24]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[25]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[26]  Nikolaus Hansen,et al.  Completely Derandomized Self-Adaptation in Evolution Strategies , 2001, Evolutionary Computation.

[27]  Kenneth O. Stanley,et al.  Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents , 2017, NeurIPS.

[28]  Jimmy Ba,et al.  Maximum Entropy Gain Exploration for Long Horizon Multi-goal Reinforcement Learning , 2020, ICML.

[29]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[30]  Risto Miikkulainen,et al.  Evolving Neural Networks through Augmenting Topologies , 2002, Evolutionary Computation.

[31]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[32]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[33]  Dorothea Heiss-Czedik,et al.  An Introduction to Genetic Algorithms. , 1997, Artificial Life.

[34]  Kenneth O. Stanley,et al.  Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning , 2017, ArXiv.

[35]  D. Goodin The cambridge dictionary of statistics , 1999 .

[36]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[37]  Sergey Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[38]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .