Learning Augmented Index Policy for Optimal Service Placement at the Network Edge

We consider the problem of service placement at the network edge, in which a decision maker has to choose between N services to host at the edge to satisfy the demands of end users. Our goal is to design adaptive algorithms to minimize the average service delivery latency for users. We pose the problem as a Markov decision process (MDP) in which the system state is given by describing, for each service, the number of users that are currently waiting at the edge to obtain the service. However, solving this N -services MDP is computationally expensive due to the curse of dimensionality. To overcome this challenge, we show that the optimal policy for a single-service MDP has an appealing threshold structure, and derive explicitly the Whittle indices for each service as a function of the number of requests from end users based on the theory of Whittle index policy. Since request arrival and service delivery rates are usually unknown and possibly time-varying, we then develop efficient learning augmented algorithms that fully utilize the structure of optimal policies with a low learning regret. The first of these is UCB-Whittle, and relies upon the principle of optimism in the face of uncertainty. The second algorithm, Q-learning-Whittle, utilizes Q-learning iterations for each service by using a two time scale stochastic approximation. We characterize the nonasymptotic performance of UCB-Whittle by analyzing its learning regret, and also analyze the convergence properties of Q-learningWhittle. Simulation results show that the proposed policies yield excellent empirical performance.

[1]  M. Herbster,et al.  Service Placement with Provable Guarantees in Heterogeneous Edge Computing Systems , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[2]  Kevin D. Glazebrook,et al.  Index Policies for the Admission Control and Routing of Impatient Customers to Heterogeneous Service Stations , 2009, Oper. Res..

[3]  A. Tulino,et al.  Joint Service Placement and Request Routing in Multi-cell Mobile Edge Computing Networks , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[4]  Atilla Eryilmaz,et al.  Asymptotically optimal downlink scheduling over Markovian fading channels , 2012, 2012 Proceedings IEEE INFOCOM.

[5]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[6]  P. Whittle Restless Bandits: Activity Allocation in a Changing World , 1988 .

[7]  Urtzi Ayesta,et al.  Dynamic Control of Birth-and-Death Restless Bandits: Application to Resource-Allocation Problems , 2016, IEEE/ACM Transactions on Networking.

[8]  S. Resnick A Probability Path , 1999 .

[9]  Peter G. Taylor,et al.  Towards Q-learning the Whittle Index for Restless Bandits , 2019, 2019 Australian & New Zealand Control Conference (ANZCC).

[10]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[11]  Shaolei Ren,et al.  Spatio–Temporal Edge Service Placement: A Bandit Learning Approach , 2018, IEEE Transactions on Wireless Communications.

[12]  P. Jacko,et al.  Congestion control of TCP flows in Internet routers by means of index policy , 2012, Comput. Networks.

[13]  I. M. Verloop Asymptotically optimal priority policies for indexable and nonindexable restless bandits , 2016, 1609.00563.

[14]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[15]  A. Karr Weak convergence of a sequence of Markov chains , 1975 .

[16]  Thomas F. La Porta,et al.  Service Placement and Request Scheduling for Data-intensive Applications in Edge Clouds , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[17]  Urtzi Ayesta,et al.  Asymptotically optimal index policies for an abandonment queue with convex holding cost , 2015, Queueing Syst. Theory Appl..

[18]  Urtzi Ayesta,et al.  Index policies for a multi-class queue with convex holding cost and abandonments , 2014, SIGMETRICS '14.

[19]  Kazuoki Azuma WEIGHTED SUMS OF CERTAIN DEPENDENT RANDOM VARIABLES , 1967 .

[20]  Thomas F. La Porta,et al.  It's Hard to Share: Joint Service Placement and Request Scheduling in Edge Clouds with Sharable and Non-Sharable Resources , 2018, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).

[21]  Mingyan Liu,et al.  Online Learning of Rested and Restless Bandits , 2011, IEEE Transactions on Information Theory.

[22]  Xu Chen,et al.  Adaptive User-managed Service Placement for Mobile Edge Computing: An Online Learning Approach , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[23]  John N. Tsitsiklis,et al.  The complexity of optimal queueing network control , 1994, Proceedings of IEEE 9th Annual Conference on Structure in Complexity Theory.

[24]  Qing Zhao,et al.  Logarithmic weak regret of non-Bayesian restless multi-armed bandit , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Yi Ouyang,et al.  Learning Unknown Markov Decision Processes: A Thompson Sampling Approach , 2017, NIPS.

[26]  Alessandro Panconesi,et al.  Concentration of Measure for the Analysis of Randomized Algorithms , 2009 .

[27]  Qing Zhao,et al.  Indexability of Restless Bandit Problems and Optimality of Whittle Index for Dynamic Multichannel Access , 2008, IEEE Transactions on Information Theory.

[28]  Jeffrey G. Andrews,et al.  Femtocells: Past, Present, and Future , 2012, IEEE Journal on Selected Areas in Communications.

[29]  Eytan Modiano,et al.  Minimizing the Age of Information in broadcast wireless networks , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[30]  Vivek S. Borkar,et al.  Index Policies for Real-Time Multicast Scheduling for Wireless Broadcast Systems , 2008, IEEE INFOCOM 2008 - The 27th Conference on Computer Communications.

[31]  R. Weber,et al.  On an index policy for restless bandits , 1990, Journal of Applied Probability.

[32]  Ambuj Tewari,et al.  Regret Bounds for Thompson Sampling in Episodic Restless Bandit Problems , 2019, NeurIPS.

[33]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[34]  J. Little A Proof for the Queuing Formula: L = λW , 1961 .

[35]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[36]  Xueying Guo,et al.  Index policies for optimal mean-variance trade-off of inter-delivery times in real-time sensor networks , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[37]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[38]  Albert N. Shiryaev,et al.  Optimal Stopping Rules , 2011, International Encyclopedia of Statistical Science.

[39]  Jun Li,et al.  Service Placement for Collaborative Edge Applications , 2021, IEEE/ACM Transactions on Networking.

[40]  Vivek S. Borkar,et al.  A learning algorithm for the Whittle index policy for scheduling web crawlers , 2019, 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[41]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[42]  Eytan Modiano,et al.  Scheduling Policies for Minimizing Age of Information in Broadcast Wireless Networks , 2018, IEEE/ACM Transactions on Networking.

[43]  Mahadev Satyanarayanan,et al.  An empirical study of latency in an emerging class of edge computing applications for wearable cognitive assistance , 2017, SEC.

[44]  Krishnakant V. Saboo,et al.  An index policy for dynamic pricing in cloud computing under price commitments , 2017 .

[45]  Song Guo,et al.  Joint Optimization of Task Scheduling and Image Placement in Fog Computing Supported Software-Defined Embedded System , 2016, IEEE Transactions on Computers.

[46]  Zhisheng Niu,et al.  An index based task assignment policy for achieving optimal power-delay tradeoff in edge cloud systems , 2016, 2016 IEEE International Conference on Communications (ICC).

[47]  V. Borkar,et al.  Whittle index based Q-learning for restless bandits with average reward , 2020, Autom..

[48]  P. R. Kumar,et al.  Reward Biased Maximum Likelihood Estimation for Reinforcement Learning , 2021, L4DC.

[49]  Vivek S. Borkar,et al.  A reinforcement learning algorithm for restless bandits , 2018, 2018 Indian Control Conference (ICC).

[50]  Fei Xu,et al.  Winning at the Starting Line: Joint Network Selection and Service Placement for Mobile Edge Computing , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[51]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[52]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[53]  Harold J. Kushner,et al.  wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[54]  Kevin D. Glazebrook,et al.  Indexability and Index Heuristics for a Simple Class of Inventory Routing Problems , 2009, Oper. Res..

[55]  Xu Han,et al.  Cost Aware Service Placement and Load Dispatching in Mobile Cloud Systems , 2016, IEEE Transactions on Computers.