The Multi-Armed Bandit With Stochastic Plays

We extend the stochastic multi-armed bandit to the case where the number of arms to play evolves as a stationary process. Our work is motivated by demand response in power systems, in which the number of arms to play, or loads to dispatch, depends on a random power imbalance. We give an upper confidence bound-based algorithm that achieves sublinear pseudo-regret. We apply our results in several examples from demand response.

[1]  Antoine Lesage-Landry,et al.  Learning to Shift Thermostatically Controlled Loads , 2017, HICSS.

[2]  Ram Rajagopal,et al.  Online learning for demand response , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[3]  Atsuyoshi Nakamura,et al.  Algorithms for Adversarial Bandit Problems with Multiple Plays , 2010, ALT.

[4]  Wei Chen,et al.  Combinatorial multi-armed bandit: general framework, results and applications , 2013, ICML 2013.

[5]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[6]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[7]  Alec N. Brooks,et al.  Vehicle-to-grid demonstration project: grid regulation ancillary service with a battery electric vehicle. , 2002 .

[8]  Peter Palensky,et al.  Demand Side Management: Demand Response, Intelligent Energy Systems, and Smart Loads , 2011, IEEE Transactions on Industrial Informatics.

[9]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[10]  Vijay Arya,et al.  Planning Curtailment of Renewable Generation in Power Grids , 2016, ICAPS.

[11]  Bhaskar Krishnamachari,et al.  Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations , 2010, IEEE/ACM Transactions on Networking.

[12]  Zheng Wen,et al.  Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits , 2014, AISTATS.

[13]  J. Walrand,et al.  Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .

[14]  Joshua A. Taylor,et al.  Index Policies for Demand Response , 2014, IEEE Transactions on Power Systems.

[15]  P. Whittle Restless bandits: activity allocation in a changing world , 1988, Journal of Applied Probability.

[16]  Ian A. Hiskens,et al.  Achieving Controllability of Electric Loads , 2011, Proceedings of the IEEE.

[17]  Alexandre Proutière,et al.  Combinatorial Bandits Revisited , 2015, NIPS.

[18]  Johanna L. Mathieu,et al.  Uncertainty in Demand Response—Identification, Estimation, and Learning , 2015 .

[19]  Mingyan Liu,et al.  Adaptive demand response: Online learning of restless and controlled bandits , 2014, 2014 IEEE International Conference on Smart Grid Communications (SmartGridComm).

[20]  R. Agrawal,et al.  Multi-armed bandit problems with multiple plays and switching cost , 1990 .

[21]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.