Reinforcement Learning for Resource Allocation in LEO Satellite Networks

In this paper, we develop and assess online decision-making algorithms for call admission and routing for low Earth orbit (LEO) satellite networks. It has been shown in a recent paper that, in a LEO satellite system, a semi-Markov decision process formulation of the call admission and routing problem can achieve better performance in terms of an average revenue function than existing routing methods. However, the conventional dynamic programming (DP) numerical solution becomes prohibited as the problem size increases. In this paper, two solution methods based on reinforcement learning (RL) are proposed in order to circumvent the computational burden of DP. The first method is based on an actor-critic method with temporal-difference (TD) learning. The second method is based on a critic-only method, called optimistic TD learning. The algorithms enhance performance in terms of requirements in storage, computational complexity and computational time, and in terms of an overall long-term average revenue function that penalizes blocked calls. Numerical studies are carried out, and the results obtained show that the RL framework can achieve up to 56% higher average revenue over existing routing methods used in LEO satellite networks with reasonable storage and computational requirements

[1]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[2]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  M. Kurano LEARNING ALGORITHMS FOR MARKOV DECISION PROCESSES , 1987 .

[4]  G. Tesauro Practical Issues in Temporal Difference Learning , 1992 .

[5]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[6]  Michael L. Littman,et al.  Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach , 1993, NIPS.

[7]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[8]  Thomas G. Dietterich,et al.  High-Performance Job-Shop Scheduling With A Time-Delay TD(λ) Network , 1995, NIPS 1995.

[9]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[10]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[11]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[12]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[13]  Markus Werner,et al.  ATM-Based Routing in LEO/MEO Satellite Networks with Intersatellite Links , 1997, IEEE J. Sel. Areas Commun..

[14]  Markus Werner,et al.  A Dynamic Routing Concept for ATM-Based Satellite Personal Communication Networks , 1997, IEEE J. Sel. Areas Commun..

[15]  Markus Werner,et al.  A neural network approach to distributed adaptive routing of LEO intersatellite link traffic , 1998, VTC '98. 48th IEEE Vehicular Technology Conference. Pathway to Global Wireless Revolution (Cat. No.98CH36151).

[16]  H. Uzunalioglu,et al.  Probabilistic routing protocol for low Earth orbit satellite networks , 1998, ICC '98. 1998 IEEE International Conference on Communications. Conference Record. Affiliated with SUPERCOMM'98 (Cat. No.98CH36220).

[17]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[18]  John N. Tsitsiklis,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[19]  Vivek S. Borkar,et al.  Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[20]  P.T.S. Tam,et al.  An optimized routing scheme and a channel reservation strategy for a low Earth orbit satellite system , 1999, Gateway to 21st Century Communications Village. VTC 1999-Fall. IEEE VTS 50th Vehicular Technology Conference (Cat. No.99CH36324).

[21]  Sang Lyul Min,et al.  A predictive call admission control scheme for low Earth orbit satellite networks , 2000, IEEE Trans. Veh. Technol..

[22]  Ian F. Akyildiz,et al.  A routing algorithm for connection‐oriented Low Earth Orbit (LEO) satellite networks with dynamic connectivity , 2000, Wirel. Networks.

[23]  Fotini-Niovi Pavlidou,et al.  Performance study of adaptive routing algorithms for LEO satellite constellations under Self-Similar and Poisson traffic , 2000, Space Commun..

[24]  John N. Tsitsiklis,et al.  Call admission control and routing in integrated services networks using neuro-dynamic programming , 2000, IEEE Journal on Selected Areas in Communications.

[25]  Timothy X. Brown,et al.  Adaptive call admission control under quality of service constraints: a reinforcement learning solution , 2000, IEEE Journal on Selected Areas in Communications.

[26]  Jakob Carlström Reinforcement learning for admission control and routing , 2000 .

[27]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[28]  Lex Weaver,et al.  The Optimal Reward Baseline for Gradient-Based Reinforcement Learning , 2001, UAI.

[29]  Lex Weaver,et al.  A Multi-Agent Policy-Gradient Approach to Network Routing , 2001, ICML.

[30]  Leandros Tassiulas,et al.  Provision of guaranteed services in broadband LEO satellite networks , 2002, Comput. Networks.

[31]  J. Barria,et al.  Markov decision theory framework for resource allocation in LEO satellite constellations , 2002 .

[32]  Leonid Peshkin,et al.  Reinforcement learning for adaptive routing , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[33]  Laurent Franck,et al.  Signaling for inter-satellite link routing in broadband non-GEO satellite systems , 2002, Comput. Networks.

[34]  Pau-Lo Hsu,et al.  A cooperative policy for conflict resolution to multi-agent exploration , 2010 .

[35]  Chen-Khong Tham,et al.  Adaptive provisioning of differentiated services networks based on reinforcement learning , 2003, IEEE Trans. Syst. Man Cybern. Part C.

[36]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[37]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[38]  Javier A. Barria,et al.  A reinforcement learning ticket-based probing path discovery scheme for MANETs , 2004, Ad Hoc Networks.

[39]  Erol Gelenbe,et al.  Self-aware networks and QoS , 2004, Proceedings of the IEEE.

[40]  Sridhar Mahadevan,et al.  Average reward reinforcement learning: Foundations, algorithms, and empirical results , 2004, Machine Learning.

[41]  Wipawee Usaha Resource allocation in networks with dynamic topology , 2004 .

[42]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[43]  SRIDHAR MAHADEVAN,et al.  Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results , 2005, Machine Learning.

[44]  Abraham Thomas,et al.  LEARNING ALGORITHMS FOR MARKOV DECISION PROCESSES , 2009 .