Understanding Curriculum Learning in Policy Optimization for Solving Combinatorial Optimization Problems

Over the recent years, reinforcement learning (RL) starts to show promising results in tackling combinatorial optimization (CO) problems, in particular when coupled with curriculum learning to facilitate training. Despite emerging empirical evidence, theoretical study on why RL helps is still at its early stage. This paper presents the first systematic study on policy optimization methods for online CO problems. We show that online CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs), and prove convergence bounds on natural policy gradient (NPG) for solving LMDPs. Furthermore, our theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift, a critical quantity that governs the convergence rate in our theorem. For a canonical online CO problem, Secretary Problem, we formally prove that distribution shift is reduced exponentially with curriculum learning even if the curriculum is randomly generated . Our theory also shows we can simplify the curriculum learning scheme used in prior work from multi-step to single-step. Lastly, we provide extensive experiments on Secretary Problem and Online Knapsack to verify our findings.

[1]  Yuxin Chen,et al.  Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization , 2020, Oper. Res..

[2]  Shie Mannor,et al.  Reinforcement Learning in Reward-Mixing MDPs , 2021, NeurIPS.

[3]  Elias Boutros Khalil,et al.  Deep Policies for Online Bipartite Matching: A Reinforcement Learning Approach , 2021, ArXiv.

[4]  Elad Hazan,et al.  A Boosting Approach to Reinforcement Learning , 2021, ArXiv.

[5]  R. Srikant,et al.  Linear Convergence of Entropy-Regularized Natural Policy Gradient with Linear Function Approximation , 2021, 2106.04096.

[6]  Quoc V. Le,et al.  A graph placement methodology for fast chip design , 2021, Nature.

[7]  Zohar Feldman,et al.  SOLO: Search Online, Learn Offline for Combinatorial Optimization Problems , 2021, SOCS.

[8]  Brian T. Denton,et al.  Multi-model Markov decision processes , 2021, IISE Trans..

[9]  Andrea Lodi,et al.  Combinatorial optimization and reasoning with graph neural networks , 2021, IJCAI.

[10]  Shie Mannor,et al.  RL for Latent MDPs: Regret Guarantees and a Lower Bound , 2021, NeurIPS.

[11]  Susanne Albers,et al.  Improved Online Algorithms for Knapsack and GAP in the Random Order Model , 2020, Algorithmica.

[12]  Brendan O'Donoghue,et al.  Sample Efficient Reinforcement Learning with REINFORCE , 2020, AAAI.

[13]  Jalaj Bhandari,et al.  On the Linear Convergence of Policy Gradient Methods for Finite MDPs , 2020, AISTATS.

[14]  Evgeny Burnaev,et al.  Reinforcement Learning for Combinatorial Optimization: A Survey , 2020, Comput. Oper. Res..

[15]  Sham M. Kakade,et al.  On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , 2019, J. Mach. Learn. Res..

[16]  Yoshua Bengio,et al.  Machine Learning for Combinatorial Optimization: a Methodological Tour d'Horizon , 2018, Eur. J. Oper. Res..

[17]  Tianyi Zhou,et al.  CO-PILOT: COllaborative Planning and reInforcement Learning On sub-Task curriculum , 2021, NeurIPS.

[18]  Javad Lavaei,et al.  On the Global Convergence of Momentum-based Policy Gradient , 2021, ArXiv.

[19]  Jeff A. Bilmes,et al.  Robust Curriculum Learning: from clean label detection to noisy label self-correction , 2021, ICLR.

[20]  Jeff A. Bilmes,et al.  Curriculum Learning by Optimizing Learning Dynamics , 2021, AISTATS.

[21]  Wotao Yin,et al.  An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods , 2022, NeurIPS.

[22]  Mangpo Phitchaya Phothilimtha,et al.  Transferable Graph Optimizers for ML Compilers , 2020, NeurIPS.

[23]  Yoshua Bengio,et al.  Mastering Rate based Curriculum Learning , 2020, ArXiv.

[24]  David P. Williamson,et al.  Learning to Solve Combinatorial Optimization Problems on Real-World Graphs in Linear Time , 2020, 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA).

[25]  Daniel F. Perez-Ramirez,et al.  Learning Combinatorial Optimization on Graphs: A Survey With Applications to Networking , 2020, IEEE Access.

[26]  Tim Roughgarden,et al.  Data-driven algorithm design , 2020, Commun. ACM.

[27]  Csaba Szepesvari,et al.  On the Global Convergence Rates of Softmax Policy Gradient Methods , 2020, ICML.

[28]  Matthew E. Taylor,et al.  Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey , 2020, J. Mach. Learn. Res..

[29]  Zhaoran Wang,et al.  Neural Policy Gradient Methods: Global Optimality and Rates of Convergence , 2019, ICLR.

[30]  Jeff A. Bilmes,et al.  Curriculum Learning by Dynamic Instance Hardness , 2020, NeurIPS.

[31]  Yuhao Zhang,et al.  Tight Competitive Ratios of Classic Matching Algorithms in the Fully Online Model , 2019, SODA.

[32]  David Bergman,et al.  Improving Optimization Bounds using Machine Learning: Decision Diagrams meet Deep Reinforcement Learning , 2018, AAAI.

[33]  Max Welling,et al.  Attention, Learn to Solve Routing Problems! , 2018, ICLR.

[34]  David L. Dill,et al.  Learning a SAT Solver from Single-Bit Supervision , 2018, ICLR.

[35]  Zhiyi Huang,et al.  Online Combinatorial Optimization Problems with Non-linear Objectives , 2019, Nonlinear Combinatorial Optimization.

[36]  Javier Ruiz-del-Solar,et al.  Interactive Learning with Corrective Feedback for Policies based on Deep Neural Networks , 2018, ISER.

[37]  Lawrence V. Snyder,et al.  Reinforcement Learning for Solving the Vehicle Routing Problem , 2018, NeurIPS.

[38]  LEARNS OLD TRICKS,et al.  A new dog learns old tricks: RL finds classic optimization algorithms , 2018, ICLR.

[39]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[40]  Le Song,et al.  2 Common Formulation for Greedy Algorithms on Graphs , 2018 .

[41]  Samy Bengio,et al.  Neural Combinatorial Optimization with Reinforcement Learning , 2016, ICLR.

[42]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[43]  Piotr Sankowski,et al.  Stochastic analyses for online combinatorial optimization problems , 2008, SODA '08.

[44]  Nicole Immorlica,et al.  A Knapsack Secretary Problem with Applications , 2007, APPROX-RANDOM.

[45]  Martin Grötschel,et al.  Combinatorial Online Optimization in Real Time , 2001 .

[46]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[47]  M. Beckmann,et al.  Dynamic programming and the secretary problem , 1990 .