Leader-Follower semi-Markov Decision Problems: Theoretical Framework and Approximate Solution

Leader-follower problems are hierarchical decision problems in which a leader uses incentives to induce certain desired behavior among a set of self-interested followers. Dynamic leader-follower problems extend this structure to multi-period decision situations. In this work we propose a Markov decision process (MDP) framework for a class of dynamic leader-follower problems that have important applications and discuss their approximate solution using reinforcement learning (RL). In these problems, the leader makes incentive decisions intermittently while the followers make their decisions in every period. Our theoretical framework and computational approach are based on the observation that such dynamic problems can be thought of as consisting of two coupled sequential decision processes, that of the leader and of the followers. In our formulation, the leader's decision problem that has the structure of a single-agent semi-Markov decision process (SMDP), and the followers' sequential decision problem structured as a stochastic game (multiagent competitive MDP) operate over the same state space. We call this MDP framework a leader-follower semi-Markov decision process (LFSMDP). We consider approximate solution of these problems using RL and demonstrate the solution approach in the special case where the followers' stochastic game is a repeated game.

[1]  J. Filar,et al.  Competitive Markov Decision Processes , 1996 .

[2]  Jose B. Cruz,et al.  Optimal and Near-Optimal Incentive Strategies in the Hierarchical Control of Markov Chains , 1983 .

[3]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[4]  D. Fudenberg,et al.  The Theory of Learning in Games , 1998 .

[5]  Fernando Bernstein,et al.  Coordinating Supply Chains with Simple Pricing Schemes: The Role of Vendor-Managed Inventories , 2006, Manag. Sci..

[6]  David S. Leslie,et al.  Individual Q-Learning in Normal Form Games , 2005, SIAM J. Control. Optim..

[7]  Harri Ehtamo,et al.  Recent Studies on Incentive Design Problems in Game Theory and Management Science , 2002 .

[8]  E. J. Collins,et al.  Convergent multiple-timescales reinforcement learning algorithms in normal form games , 2003 .

[9]  S. Marcus,et al.  Multi-time Scale Markov Decision Processes , 2005 .

[10]  Y. Narahari,et al.  Design of Incentive Compatible Mechanisms for Stackelberg Problems , 2005, WINE.

[11]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994, Wiley Series in Probability and Statistics.

[12]  T. Başar,et al.  Dynamic Noncooperative Game Theory , 1982 .

[13]  S. Mahadevan,et al.  Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning , 1999 .

[14]  A. Keyhani Leader-Follower Framework for Control of Energy Services , 2002, IEEE Power Engineering Review.

[15]  R. Radner Repeated Principal-Agent Games with Discounting , 1985 .

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 2005, IEEE Transactions on Neural Networks.

[17]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[18]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[19]  Erica L. Plambeck,et al.  Performance-Based Incentives in a Dynamic Principal-Agent Model , 2000, Manuf. Serv. Oper. Manag..

[20]  V. Borkar Stochastic approximation with two time scales , 1997 .

[21]  Mark A. Shayman,et al.  Multitime scale Markov decision processes , 2003, IEEE Trans. Autom. Control..

[22]  Abhijit Gosavi,et al.  Reinforcement learning for long-run average cost , 2004, Eur. J. Oper. Res..

[23]  Jose B. Cruz,et al.  An incentive model of duopoly with government coordination , 1981, Autom..

[24]  T. Başar,et al.  Incentive-Based Pricing for Network Games with Complete and Incomplete Information , 2007 .