POMO: Policy Optimization with Multiple Optima for Reinforcement Learning

In neural combinatorial optimization (CO), reinforcement learning (RL) can turn a deep neural net into a fast, powerful heuristic solver of NP-hard problems. This approach has a great potential in practical applications because it allows near-optimal solutions to be found without expert guides armed with substantial domain knowledge. We introduce Policy Optimization with Multiple Optima (POMO), an end-to-end approach for building such a heuristic solver. POMO is applicable to a wide range of CO problems. It is designed to exploit the symmetries in the representation of a CO solution. POMO uses a modified REINFORCE algorithm that forces diverse rollouts towards all optimal solutions. Empirically, the low-variance baseline of POMO makes RL training fast and stable, and it is more resistant to local minima compared to previous approaches. We also introduce a new augmentation-based inference method, which accompanies POMO nicely. We demonstrate the effectiveness of POMO by solving three popular NP-hard problems, namely, traveling salesman (TSP), capacitated vehicle routing (CVRP), and 0-1 knapsack (KP). For all three, our solver based on POMO shows a significant improvement in performance over all recent learned heuristics. In particular, we achieve the optimality gap of 0.14% with TSP100 while reducing inference time by more than an order of magnitude.

[1]  Yuandong Tian,et al.  Learning to Perform Local Rewriting for Combinatorial Optimization , 2019, NeurIPS.

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[4]  William J. Cook,et al.  The Traveling Salesman Problem: A Computational Study , 2007 .

[5]  Jing Peng,et al.  Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .

[6]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[8]  Samy Bengio,et al.  Neural Combinatorial Optimization with Reinforcement Learning , 2016, ICLR.

[9]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[11]  Max Welling,et al.  Attention, Learn to Solve Routing Problems! , 2018, ICLR.

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Xavier Bresson,et al.  An Efficient Graph Convolutional Network Technique for the Travelling Salesman Problem , 2019, ArXiv.

[14]  Le Song,et al.  2 Common Formulation for Greedy Algorithms on Graphs , 2018 .

[15]  Keld Helsgaun,et al.  An effective implementation of the Lin-Kernighan traveling salesman heuristic , 2000, Eur. J. Oper. Res..

[16]  Yingqian Zhang,et al.  Learning 2-opt Heuristics for the Traveling Salesman Problem via Deep Reinforcement Learning , 2020, ACML.

[17]  Yoshua Bengio,et al.  Machine Learning for Combinatorial Optimization: a Methodological Tour d'Horizon , 2018, Eur. J. Oper. Res..

[18]  Hao Lu,et al.  A Learning-based Iterative Method for Solving Vehicle Routing Problems , 2020, ICLR.

[19]  Le Song,et al.  Discriminative Embeddings of Latent Variable Models for Structured Data , 2016, ICML.

[20]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[21]  Alexandre Lacoste,et al.  Learning Heuristics for the TSP by Policy Gradient , 2018, CPAIOR.

[22]  Lawrence V. Snyder,et al.  Reinforcement Learning for Solving the Vehicle Routing Problem , 2018, NeurIPS.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[25]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[26]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[27]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Jiahai Wang,et al.  A Deep Reinforcement Learning Algorithm Using Dynamic Attention Model for Vehicle Routing Problems , 2020, ISICA.

[29]  Max Welling,et al.  Buy 4 REINFORCE Samples, Get a Baseline for Free! , 2019, DeepRLStructPred@ICLR.

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[31]  Andrew Lim,et al.  Learning Improvement Heuristics for Solving Routing Problems , 2019 .

[32]  Kevin Tierney,et al.  Neural Large Neighborhood Search for the Capacitated Vehicle Routing Problem , 2020, ECAI.

[33]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.