Learning to Route Efficiently with End-to-End Feedback: The Value of Networked Structure

We introduce efficient algorithms which achieve nearly optimal regrets for the problem of stochastic online shortest path routing with end-to-end feedback. The setting is a natural application of the combinatorial stochastic bandits problem, a special case of the linear stochastic bandits problem. We show how the difficulties posed by the large scale action set can be overcome by the networked structure of the action set. Our approach presents a novel connection between bandit learning and shortest path algorithms. Our main contribution is an adaptive exploration algorithm with nearly optimal instance-dependent regret for any directed acyclic network. We then modify it so that nearly optimal worst case regret is achieved simultaneously. Driven by the carefully designed Top-Two Comparison (TTC) technique, the algorithms are efficiently implementable. We further conduct extensive numerical experiments to show that our proposed algorithms not only achieve superior regret performances, but also reduce the runtime drastically.

[1]  Peter Auer,et al.  UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem , 2010, Period. Math. Hung..

[2]  Qing Zhao,et al.  Adaptive shortest-path routing under unknown and stochastically varying link states , 2012, 2012 10th International Symposium on Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks (WiOpt).

[3]  Bhaskar Krishnamachari,et al.  Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations , 2010, IEEE/ACM Transactions on Networking.

[4]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[5]  Tamir Hazan,et al.  Tight Bounds for Bandit Combinatorial Optimization , 2017, COLT.

[6]  Gábor Lugosi,et al.  Minimax Policies for Combinatorial Prediction Games , 2011, COLT.

[7]  Alessandro Lazaric,et al.  Linear Thompson Sampling Revisited , 2016, International Conference on Artificial Intelligence and Statistics.

[8]  Yasin Abbasi-Yadkori Forced-Exploration Based Algorithms for Playing in Stochastic Linear Bandits , 2009 .

[9]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[10]  Kin K. Leung,et al.  Identifiability of link metrics based on end-to-end path measurements , 2013, Internet Measurement Conference.

[11]  Gábor Lugosi,et al.  Regret in Online Combinatorial Optimization , 2012, Math. Oper. Res..

[12]  Elad Hazan,et al.  Volumetric Spanners: An Efficient Exploration Basis for Learning , 2013, J. Mach. Learn. Res..

[13]  Baruch Awerbuch,et al.  Online linear optimization and adaptive routing , 2008, J. Comput. Syst. Sci..

[14]  Wei Chen,et al.  Combinatorial multi-armed bandit: general framework, results and applications , 2013, ICML 2013.

[15]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[16]  Aleksandrs Slivkins,et al.  Introduction to Multi-Armed Bandits , 2019, Found. Trends Mach. Learn..

[17]  Richard Bellman,et al.  ON A ROUTING PROBLEM , 1958 .

[18]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[19]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[20]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[21]  Elad Hazan,et al.  Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.

[22]  Alessandro Lazaric,et al.  Linear Thompson Sampling Revisited , 2016, AISTATS.

[23]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[24]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[25]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[26]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[27]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[28]  Sébastien Bubeck,et al.  The entropic barrier: a simple and optimal universal self-concordant barrier , 2014, COLT.

[29]  Jianqing Fan,et al.  High-Dimensional Statistics , 2014 .

[30]  Zheng Wen,et al.  Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits , 2014, AISTATS.

[31]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[32]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[33]  Richard Combes,et al.  Stochastic Online Shortest Path Routing: The Value of Feedback , 2013, IEEE Transactions on Automatic Control.