Towards Variance Reduction for Reinforcement Learning of Industrial Decision-making Tasks: A Bi-Critic based Demand-Constraint Decoupling Approach

Learning to plan and schedule receives increasing attention due to its efficiency in problem-solving and potential to outperform heuristics. In particular, actor-critic-based reinforcement learning (RL) has been widely adopted for uncertain environments. Yet one standing challenge for applying RL to real-world industrial decision-making problems is the high variance during training. Existing efforts design novel value functions to alleviate the issue but still suffer. In this paper, we address this issue from the perspective of adjusting the actor-critic paradigm. We start by making an observation ignored in many industrial problems---the environmental dynamics for an agent consist of two parts physically independent of each other: the exogenous task demand over time and the hard constraint for action. And we theoretically show that decoupling these two effects in the actor-critic technique would reduce variance. Accordingly, we propose to decouple and model them separately in the state transition of the Markov decision process (MDP). In the demand-encoding process, the temporal task demand, e.g., the passengers for elevator scheduling is encoded followed by a critic for scoring. While in the constraint-encoding process, an actor-critic module is adopted for action embedding, and the two critics are then used for a revised advantaged function calculation. Experimental results show that our method can adaptively handle different dynamic planning and scheduling tasks and outperform recent learning-based models and traditional heuristic algorithms.

[1]  Biwei Huang,et al.  Factored Adaptation for Non-Stationary Reinforcement Learning , 2022, NeurIPS.

[2]  Lei Chen,et al.  A Data-Driven Column Generation Algorithm For Bin Packing Problem in Manufacturing Industry , 2022, ArXiv.

[3]  Kai Xu,et al.  Learning practically feasible policies for online 3D bin packing , 2021, Science China Information Sciences.

[4]  Wenhao Ding,et al.  Context-Aware Safe Reinforcement Learning for Non-Stationary Environments , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[5]  Scott M. Jordan,et al.  Towards Safe Policy Improvement for Non-Stationary MDPs , 2020, NeurIPS.

[6]  Reazul Hasan Russel,et al.  Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty , 2020, ArXiv.

[7]  Christian D. Hubbs,et al.  OR-Gym: A Reinforcement Learning Library for Operations Research Problem , 2020, ArXiv.

[8]  Yin Yang,et al.  Online 3D Bin Packing with Constrained Deep Reinforcement Learning , 2020, AAAI.

[9]  Wenhao Ding,et al.  Task-Agnostic Online Reinforcement Learning with an Infinite Mixture of Gaussian Processes , 2020, NeurIPS.

[10]  Lingxiao Wang,et al.  Optimal Elevator Group Control via Deep Asynchronous Actor–Critic Learning , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[11]  Sang Wan Lee,et al.  Task complexity interacts with state-space uncertainty in the arbitration between model-based and model-free learning , 2019, Nature Communications.

[12]  Samir Elhedhli,et al.  Three-Dimensional Bin Packing and Mixed-Case Palletization , 2019, INFORMS Journal on Optimization.

[13]  Marek Petrik,et al.  Beyond Confidence Regions: Tight Bayesian Ambiguity Sets for Robust MDPs , 2019, NeurIPS.

[14]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[15]  Hongzi Mao,et al.  Variance Reduction for Reinforcement Learning in Input-Driven Environments , 2018, ICLR.

[16]  Lawrence V. Snyder,et al.  Reinforcement Learning for Solving the Vehicle Routing Problem , 2018, NeurIPS.

[17]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Daniel Nikovski,et al.  Submodular Function Maximization for Group Elevator Scheduling , 2017, ICAPS.

[20]  Trung Thanh Nguyen,et al.  An Online Packing Heuristic for the Three-Dimensional Container Loading Problem in Dynamic Environments and the Physical Internet , 2017, EvoApplications.

[21]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[22]  Xueping Li,et al.  A hybrid differential evolution algorithm for multiple container loading problem with heterogeneous containers , 2015, Comput. Ind. Eng..

[23]  Emmanuel Hadoux,et al.  Sequential Decision-Making under Non-stationary Environments via Sequential Change-point Detection , 2014 .

[24]  Dong Hui,et al.  Research of elevator group scheduling system based on reinforcement learning algorithm , 2013, Proceedings of 2013 2nd International Conference on Measurement, Information and Control.

[25]  Edmund K. Burke,et al.  An effective heuristic for the two-dimensional irregular bin packing problem , 2013, Annals of Operations Research.

[26]  Hongfeng Wang,et al.  A hybrid genetic algorithm with a new packing strategy for the three-dimensional bin packing problem , 2012, Appl. Math. Comput..

[27]  Teodor Gabriel Crainic,et al.  TS2PACK: A two-level tabu search for the three-dimensional bin packing problem , 2009, Eur. J. Oper. Res..

[28]  Teodor Gabriel Crainic,et al.  Extreme Point-Based Heuristics for Three-Dimensional Bin Packing , 2008, INFORMS J. Comput..

[29]  Daniele Vigo,et al.  The Three-Dimensional Bin Packing Problem , 2000, Oper. Res..

[30]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[31]  Andrew G. Barto,et al.  Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.

[32]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[33]  J. O. Berkey,et al.  Two-Dimensional Finite Bin-Packing Algorithms , 1987 .

[34]  Bernard Chazelle,et al.  The Bottomn-Left Bin-Packing Heuristic: An Efficient Implementation , 1983, IEEE Transactions on Computers.

[35]  George R. Strakosch,et al.  Vertical Transportation: Elevators and Escalators , 1983 .

[36]  Kai Xu,et al.  Learning Efficient Online 3D Bin Packing on Packing Configuration Trees , 2022, ICLR.

[37]  Anton Jansson,et al.  Elevator Control Using Reinforcement Learning to Select Strategy , 2015 .

[38]  Risto Lahdelma,et al.  Estimated Time of Arrival ( ETA ) Based Elevator Group Control Algorithm with More Accurate Estimation , 2004 .

[39]  Michael L. Littman,et al.  Exact Solutions to Time-Dependent MDPs , 2000, NIPS.