Scaling Model-Based Average-Reward Reinforcement Learning for Product Delivery

Reinforcement learning in real-world domains suffers from three curses of dimensionality: explosions in state and action spaces, and high stochasticity. We present approaches that mitigate each of these curses. To handle the state-space explosion, we introduce “tabular linear functions” that generalize tile-coding and linear value functions. Action space complexity is reduced by replacing complete joint action space search with a form of hill climbing. To deal with high stochasticity, we introduce a new algorithm called ASH-learning, which is an afterstate version of H-Learning. Our extensions make it practical to apply reinforcement learning to a domain of product delivery – an optimization problem that combines inventory control and vehicle routing.

[1]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[2]  Nicola Secomandi,et al.  Comparing neuro-dynamic programming algorithms for the vehicle routing problem with stochastic demands , 2000, Comput. Oper. Res..

[3]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[4]  Malcolm J. A. Strens,et al.  Combining Planning with Reinforcement Learning for Multi-robot Task Allocation , 2004, Adaptive Agents and Multi-Agent Systems.

[5]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[6]  Michel Gendreau,et al.  Vehicle Routing Problem with Time Windows, Part II: Metaheuristics , 2005, Transp. Sci..

[7]  Michail G. Lagoudakis,et al.  Coordinated Reinforcement Learning , 2002, ICML.

[8]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[9]  Nicola Secomandi,et al.  A Rollout Policy for the Vehicle Routing Problem with Stochastic Demands , 2001, Oper. Res..

[10]  Prasad Tadepalli,et al.  Model-Based Average Reward Reinforcement Learning , 1998, Artif. Intell..

[11]  Warren B. Powell,et al.  Approximate dynamic programming for high dimensional resource allocation problems , 2005 .

[12]  Benjamin Van Roy,et al.  A neuro-dynamic programming approach to retailer inventory management , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[13]  Sridhar Mahadevan,et al.  Learning to communicate and act using hierarchical reinforcement learning , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..