Learning 2-opt Local Search from Heuristics as Expert Demonstrations

Deep Reinforcement Learning (RL) has achieved high success in solving routing problems. However, state-of-the-art deep RL approaches require a considerable amount of data before they reach reasonable performance. This may be acceptable for small problems, but as instances grow bigger, this fact severely limits the applicability of these methods to many real-world instances. In this work, we study a setting where the agent can access data from previously handcrafted heuristics for the Traveling Salesman Problem. In our setting, the agent has access to demonstrations from 2-opt improvement policies. Our goal is to learn policies that can surpass the quality of the demonstrations while requiring fewer samples than pure RL. In this study, we propose to first learn policies with Imitation Learning (IL), leveraging a small set of demonstration data to accelerate policy learning. Afterward, we combine on policy and value approximation updates to improve performance over the expert's performance. We show that our method learns good policies in a shorter time and using less data than classical policy gradient, which does not incorporate demonstration data into RL. Moreover, in terms of solution quality, it performs similarly to other state-of-the-art deep RL approaches.

[1]  Xavier Bresson,et al.  An Efficient Graph Convolutional Network Technique for the Travelling Salesman Problem , 2019, ArXiv.

[2]  Pierre Hansen,et al.  First vs. best improvement: An empirical study , 1999, Discret. Appl. Math..

[3]  Keld Helsgaun,et al.  General k-opt submoves for the Lin–Kernighan TSP heuristic , 2009, Math. Program. Comput..

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[6]  Louis Wehenkel,et al.  Online Learning for Strong Branching Approximation in Branch-and-Bound , 2016 .

[7]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[8]  Peter Rossmanith,et al.  Simulated Annealing , 2008, Taschenbuch der Algorithmen.

[9]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[10]  J. Andrew Bagnell,et al.  Efficient Reductions for Imitation Learning , 2010, AISTATS.

[11]  Wenjun Xu,et al.  Policy Gradient from Demonstration and Curiosity , 2020, IEEE transactions on cybernetics.

[12]  Le Song,et al.  Learning to Branch in Mixed Integer Programming , 2016, AAAI.

[13]  John Langford,et al.  Learning to Search Better than Your Teacher , 2015, ICML.

[14]  Max Welling,et al.  Attention, Learn to Solve Routing Problems! , 2018, ICLR.

[15]  Andrea Tramontani,et al.  Scoring positive semidefinite cutting planes for quadratic optimization via trained neural networks , 2019 .

[16]  Byron Boots,et al.  Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning , 2018, ICLR.

[17]  Andrew Lim,et al.  Learning Improvement Heuristics for Solving the Travelling Salesman Problem , 2019, ArXiv.

[18]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[19]  Yoshua Bengio,et al.  Machine Learning for Combinatorial Optimization: a Methodological Tour d'Horizon , 2018, Eur. J. Oper. Res..

[20]  Jiashi Feng,et al.  Policy Optimization with Demonstrations , 2018, ICML.

[21]  Brian W. Kernighan,et al.  An Effective Heuristic Algorithm for the Traveling-Salesman Problem , 1973, Oper. Res..

[22]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[23]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[24]  Yuandong Tian,et al.  Learning to Perform Local Rewriting for Combinatorial Optimization , 2019, NeurIPS.

[25]  Keld Helsgaun,et al.  An effective implementation of the Lin-Kernighan traveling salesman heuristic , 2000, Eur. J. Oper. Res..

[26]  Andrea Lodi,et al.  Exact Combinatorial Optimization with Graph Convolutional Neural Networks , 2019, NeurIPS.

[27]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[28]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[29]  Nolan Wagener,et al.  Fast Policy Learning through Imitation and Reinforcement , 2018, UAI.

[30]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[31]  Yingqian Zhang,et al.  Learning 2-opt Heuristics for the Traveling Salesman Problem via Deep Reinforcement Learning , 2020, ACML.

[32]  Jan Karel Lenstra,et al.  Some Simple Applications of the Travelling Salesman Problem , 1975 .

[33]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[34]  Alexandre Lacoste,et al.  Learning Heuristics for the TSP by Policy Gradient , 2018, CPAIOR.

[35]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[36]  Samy Bengio,et al.  Neural Combinatorial Optimization with Reinforcement Learning , 2016, ICLR.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.