Math Programming based Reinforcement Learning for Multi-Echelon Inventory Management

Reinforcement Learning has lead to considerable break-throughs in diverse areas such as robotics, games and many others. But the application to RL in complex real-world decision making problems remains limited. Many problems in Operations Management (inventory and revenue management, for example) are characterized by large action spaces and stochastic system dynamics. These characteristics make the problem considerably harder to solve for existing RL methods that rely on enumeration techniques to solve per step action problems. To resolve these issues, we develop Programmable Actor Reinforcement Learning (PARL), a policy iteration method that uses techniques from integer programming and sample average approximation. Analytically, we show that the for a given critic, the learned policy in each iteration converges to the optimal policy as the underlying samples of the uncertainty go to infinity. Practically, we show that a properly selected discretization of the underlying uncertain distribution can yield near optimal actor policy even with very few samples from the underlying uncertainty. We then apply our algorithm to real-world inventory management problems with complex supply chain structures and show that PARL outperforms state-of-the-art RL and inventory optimization methods in these settings. We find that PARL outperforms commonly used base stock heuristic by 51.3% and RL based methods by up to 9.58% on average across different supply chain environments.

[1]  Panos M. Pardalos,et al.  Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[2]  Wenbo Chen,et al.  A heuristic based on quadratic approximation for dual sourcing problem with general lead times and supply capacity uncertainty , 2019, IISE Trans..

[3]  Marco Laumanns,et al.  A typology and literature review on stochastic multi-echelon inventory models , 2018, Eur. J. Oper. Res..

[4]  Christian Tjandraatmadja,et al.  Strong mixed-integer programming formulations for trained neural networks , 2018, Mathematical Programming.

[5]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[6]  Christian Tjandraatmadja,et al.  Reinforcement Learning with Combinatorial Actions: An Application to Vehicle Routing , 2020, NeurIPS.

[7]  Shipra Agrawal,et al.  Learning in Structured MDPs with Convex Cost Functions: Improved Regret Bounds for Inventory Management , 2019, EC.

[8]  Lawrence V. Snyder,et al.  Heuristics for base-stock levels in multi-echelon distribution networks , 2017 .

[9]  M. Schwind,et al.  A Reinforcement Learning Approach for Supply Chain Management , 2022 .

[10]  Huseyin Topaloglu,et al.  Using Stochastic Approximation Methods to Compute Optimal Base-Stock Levels in Inventory Control Problems , 2008, Oper. Res..

[11]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[12]  Linwei Xin,et al.  Understanding the Performance of Capped Base-Stock Policies in Lost-Sales Inventory Models , 2020, SSRN Electronic Journal.

[13]  Benjamin Van Roy,et al.  A neuro-dynamic programming approach to retailer inventory management , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[14]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[15]  A. Shapiro Monte Carlo Sampling Methods , 2003 .

[16]  Paul H. Zipkin Old and New Methods for Lost-Sales Inventory Systems , 2008, Oper. Res..

[17]  Sven Axsäter,et al.  Optimization of Order-up-to-S Policies in Two-Echelon Inventory Systems with Periodic Review , 1993 .

[18]  Harshad Khadilkar,et al.  Reinforcement Learning for Multi-Product Multi-Node Inventory Management in Supply Chains , 2020, ArXiv.

[19]  Jan A. Van Mieghem,et al.  Robust Dual Sourcing Inventory Management: Optimality of Capped Dual Index Policies and Smoothing , 2017, Manuf. Serv. Oper. Manag..

[20]  Lawrence V. Snyder,et al.  A Deep Q-Network for the Beer Game : Reinforcement Learning for Inventory Optimization , 2019 .

[21]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[22]  Craig Boutilier,et al.  CAQL: Continuous Action Q-Learning , 2020, ICLR.

[23]  Hayaru Shouno,et al.  Analysis of function of rectified linear unit used in deep learning , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[24]  R. Pasupathy,et al.  A Guide to Sample Average Approximation , 2015 .

[25]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[26]  Alexander Shapiro,et al.  Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[27]  Woonghee Tim Huh,et al.  Asymptotic Optimality of Order-Up-To Policies in Lost Sales Inventory Systems , 2009, Manag. Sci..

[28]  Jiming Liu,et al.  Reinforcement Learning in Healthcare: A Survey , 2019, ACM Comput. Surv..

[29]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[30]  Lawrence V. Snyder,et al.  Simultaneous Decision Making for Stochastic Multi-echelon Inventory Optimization with Deep Neural Networks as Decision Makers , 2020, ArXiv.

[31]  Sridhar Seshadri,et al.  New Policies for the Stochastic Inventory Control Problem with Two Supply Sources , 2010, Oper. Res..

[32]  Pierpaolo Pontrandolfo,et al.  Inventory management in supply chains: a reinforcement learning approach , 2002 .

[33]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[34]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[35]  Mark S. Squillante,et al.  Asymptotic Optimality of Constant-Order Policies for Lost Sales Inventory Models with Large Lead Times , 2012, Math. Oper. Res..

[36]  Juan Pablo Vielma,et al.  The Convex Relaxation Barrier, Revisited: Tightened Single-Neuron Relaxations for Neural Network Verification , 2020, Neural Information Processing Systems.

[37]  Herbert E. Scarf,et al.  Optimal Policies for a Multi-Echelon Inventory Problem , 1960, Manag. Sci..

[38]  T. V. Lakshman,et al.  Deep Neural Network Approximated Dynamic Programming for Combinatorial Optimization , 2020, AAAI.

[39]  Paul H. Zipkin On the Structure of Lost-Sales Inventory Models , 2008, Oper. Res..

[40]  Zuo-Jun Max Shen,et al.  Fundamentals of Supply Chain Theory , 2011 .

[41]  Paul H. Zipkin,et al.  Approximations of Dynamic, Multilocation Production and Inventory Problems , 1984 .

[42]  Huseyin Topaloglu,et al.  A duality‐based relaxation and decomposition approach for inventory distribution systems , 2008 .

[43]  Awi Federgruen,et al.  Two‐echelon distribution systems with random demands and storage constraints , 2018 .