Non-divergent Imitation for Verification of Complex Learned Controllers

We consider the problem of verifying complex learned controllers using distillation. In contrast to previous work, we require that the distilled model maintains behavioural fidelity with an oracle, defining the notion of non-divergent path length (NPL) as a metric. We demonstrate that current distillation approaches with proven accuracy bounds do not have high expected NPL and can be out-performed by naive behavioural cloning. We thus propose a distillation algorithm that typically gives greater expected NPL, improved sample efficiency, and more compact models. We prove properties of NPL maximization and demonstrate the performance of our algorithm on deep Q-network controllers for three standard learning environments that have been used in this context: Pong, CartPole and MountainCar.

[1]  Zhi-Li Zhang,et al.  f-GAIL: Learning f-Divergence for Generative Adversarial Imitation Learning , 2020, NeurIPS.

[2]  Sean Sedwards,et al.  Improved Policy Extraction via Online Q-Value Distillation , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[3]  Dong Xu,et al.  Systematic Generation of Diverse Benchmarks for DNN Verification , 2020, CAV.

[4]  Joelle Pineau,et al.  Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization , 2020, NeurIPS.

[5]  Sarfraz Khurshid,et al.  MoËT: Interpretable and Verifiable Reinforcement Learning via Mixture of Expert Trees , 2019, ArXiv.

[6]  Hoang Minh Le,et al.  Imitation-Projected Programmatic Reinforcement Learning , 2019, NeurIPS.

[7]  Suresh Jagannathan,et al.  An inductive synthesis framework for verifiable reinforcement learning , 2019, PLDI.

[8]  Simin Nadjm-Tehrani,et al.  Formal Verification of Input-Output Mappings of Tree Ensembles , 2019, Sci. Comput. Program..

[9]  Mykel J. Kochenderfer,et al.  Algorithms for Verifying Deep Neural Networks , 2019, Found. Trends Optim..

[10]  Armando Solar-Lezama,et al.  Verifiable Reinforcement Learning via Policy Extraction , 2018, NeurIPS.

[11]  Leonid Ryzhyk,et al.  Verifying Properties of Binarized Deep Neural Networks , 2017, AAAI.

[12]  Mykel J. Kochenderfer,et al.  Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks , 2017, CAV.

[13]  Marc G. Bellemare,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[14]  N. Speybroeck Classification and regression trees , 2012, International Journal of Public Health.

[15]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[16]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[17]  J. Andrew Bagnell,et al.  Efficient Reductions for Imitation Learning , 2010, AISTATS.

[18]  Nikolaj Bjørner,et al.  Z3: An Efficient SMT Solver , 2008, TACAS.

[19]  T. Margaria,et al.  Tools and Algorithms for the Construction and Analysis of Systems , 1998, Lecture Notes in Computer Science.

[20]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[21]  Thomas A. Henzinger,et al.  Handbook of Model Checking , 2018, Springer International Publishing.

[22]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .