Learning and Selecting the Right Customers for Reliability: A Multi-Armed Bandit Approach

In this paper, we consider residential demand response (DR) programs where an aggregator calls upon some residential customers to change their demand so that the total load adjustment is as close to a target value as possible. Major challenges lie in the uncertainty and randomness of the customer behaviors in response to DR signals, and the limited knowledge available to the aggregator of the customers. To learn and select the right customers, we formulate the DR problem as a combinatorial multi-armed bandit (CMAB) problem with a reliability goal. We propose a learning algorithm: CUCB-Avg (Combinatorial Upper Confidence Bound-Average), which utilizes both upper confidence bounds and sample averages to balance the tradeoff between exploration (learning) and exploitation (selecting). We prove that CUCB-Avg achieves $O(\log T)$ regret given a time-invariant target. Simulation results demonstrate that our CUCB-Avg performs significantly better than the classic algorithm CUCB (Combinatorial Upper Confidence Bound).

[1]  Pierluigi Siano,et al.  Ontology and semantic web for manufacturing , 2014 .

[2]  Zheng Wen,et al.  Matroid Bandits: Fast Combinatorial Optimization with Learning , 2014, UAI.

[3]  Antoine Lesage-Landry,et al.  The Multi-Armed Bandit With Stochastic Plays , 2018, IEEE Transactions on Automatic Control.

[4]  Bhaskar Krishnamachari,et al.  Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations , 2010, IEEE/ACM Transactions on Networking.

[5]  Zheng Wen,et al.  Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits , 2014, AISTATS.

[6]  Yajun Wang,et al.  Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms , 2014, J. Mach. Learn. Res..

[7]  Benjamin Van Roy,et al.  A Tutorial on Thompson Sampling , 2017, Found. Trends Mach. Learn..

[8]  Farrokh Rahimi,et al.  Demand Response as a Market Resource Under the Smart Grid Paradigm , 2010, IEEE Transactions on Smart Grid.

[9]  Shie Mannor,et al.  Thompson Sampling for Complex Online Problems , 2013, ICML.

[10]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[11]  Mohammed H. Albadi,et al.  Demand Response in Electricity Markets: An Overview , 2007, 2007 IEEE Power Engineering Society General Meeting.

[12]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[13]  Gábor Lugosi,et al.  Regret in Online Combinatorial Optimization , 2012, Math. Oper. Res..

[14]  Marco Levorato,et al.  Residential Demand Response Using Reinforcement Learning , 2010, 2010 First IEEE International Conference on Smart Grid Communications.

[15]  Na Li,et al.  Mechanism design for reliability in demand response with uncertainty , 2017, 2017 American Control Conference (ACC).

[16]  Alexandre Proutière,et al.  Combinatorial Bandits Revisited , 2015, NIPS.

[17]  Mingyan Liu,et al.  Adaptive demand response: Online learning of restless and controlled bandits , 2014, 2014 IEEE International Conference on Smart Grid Communications (SmartGridComm).

[18]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[19]  Na Li,et al.  Optimal demand response based on utility maximization in power networks , 2011, 2011 IEEE Power and Energy Society General Meeting.

[20]  Y. Narahari,et al.  A Multiarmed Bandit Incentive Mechanism for Crowdsourcing Demand Response in Smart Grids , 2014, AAAI.

[21]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.