Online Learning Schemes for Power Allocation in Energy Harvesting Communications

We consider the problem of power allocation over one or more time-varying channels with unknown distributions in energy harvesting communications. In the single-channel case, the transmitter chooses the transmit power based on the amount of stored energy in its battery with the goal of maximizing the average rate over time. We model this problem as a Markov decision process (MDP) with transmitter as the agent, battery status as the state, transmits power as the action and rate as the reward. The average reward maximization problem can be modeled by a linear program (LP) that uses the transition probabilities for the state-action pairs and their reward values to select a power allocation policy. This problem is challenging because the uncertainty in channels implies that the mean rewards associated with the state-action pairs are unknown. We therefore propose two online learning algorithms: linear program of sample means (LPSM) and Epoch-LPSM that learn these rewards and adapt their policies over time. For both algorithms, we prove that their regret is upper-bounded by a constant. To our knowledge this is the first result showing constant regret learning algorithms for MDPs with unknown mean rewards. We also prove an even stronger result about LPSM: that its policy matches the optimal policy exactly in finite expected time. Epoch-LPSM incurs a higher regret compared with the LPSM, while reducing the computational requirements substantially. We further consider a multi-channel scenario, where the agent also chooses a channel in each slot, and present our multi-channel LPSM (MC-LPSM) algorithm that explores different channels and uses that information to solve the LP during exploitation. MC-LPSM incurs a regret that scales logarithmically in time and linearly in the number of channels. Through a matching lower bound on the regret of any algorithm, we also prove the asymptotic order optimality of MC-LPSM.

[1]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[2]  Bhaskar Krishnamachari,et al.  Online learning of power allocation policies in energy harvesting communications , 2016, 2016 International Conference on Signal Processing and Communications (SPCOM).

[3]  Biplab Sikdar,et al.  Energy efficient transmission strategies for Body Sensor Networks with energy harvesting , 2008, 2008 42nd Annual Conference on Information Sciences and Systems.

[4]  Apostolos Burnetas,et al.  Optimal Adaptive Policies for Markov Decision Processes , 1997, Math. Oper. Res..

[5]  Xiaodong Wang,et al.  Communication of Energy Harvesting Tags , 2012, IEEE Transactions on Communications.

[6]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[7]  Stephen P. Boyd,et al.  CVXPY: A Python-Embedded Modeling Language for Convex Optimization , 2016, J. Mach. Learn. Res..

[8]  John Langford,et al.  Efficient Optimal Learning for Contextual Bandits , 2011, UAI.

[9]  Rui Zhang,et al.  Optimal Energy Allocation for Wireless Communications With Energy Harvesting Constraints , 2011, IEEE Transactions on Signal Processing.

[10]  Kaibin Huang,et al.  Energy Harvesting Wireless Communications: A Review of Recent Advances , 2015, IEEE Journal on Selected Areas in Communications.

[11]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[12]  V. Climenhaga Markov chains and mixing times , 2013 .

[13]  Jing Yang,et al.  Transmission with Energy Harvesting Nodes in Fading Wireless Channels: Optimal Policies , 2011, IEEE Journal on Selected Areas in Communications.

[14]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[15]  Roy D. Yates,et al.  A generic model for optimizing single-hop transmission policy of replenishable sensors , 2009, IEEE Transactions on Wireless Communications.

[16]  Peter Auer,et al.  Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[17]  Neelesh B. Mehta,et al.  Transmit Power Control Policies for Energy Harvesting Sensors With Retransmissions , 2013, IEEE Journal of Selected Topics in Signal Processing.

[18]  Jing Yang,et al.  Optimal Packet Scheduling in an Energy Harvesting Communication System , 2010, IEEE Transactions on Communications.

[19]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[20]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[21]  Prasanna Chaporkar,et al.  Optimal power allocation for a renewable energy source , 2011, 2012 National Conference on Communications (NCC).

[22]  B. Krishnamachari,et al.  Efficient Scheduling for Energy-Delay Tradeoff on a Time-Slotted Channel , 2015 .

[23]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[24]  Elizabeth L. Wilmer,et al.  Markov Chains and Mixing Times , 2008 .

[25]  Aylin Yener,et al.  Optimum Transmission Policies for Battery Limited Energy Harvesting Nodes , 2010, IEEE Transactions on Wireless Communications.

[26]  Sattar Vakili,et al.  Deterministic Sequencing of Exploration and Exploitation for Multi-Armed Bandit Problems , 2011, IEEE Journal of Selected Topics in Signal Processing.

[27]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[28]  Bhaskar Krishnamachari,et al.  Stochastic Contextual Bandits with Known Reward Functions , 2016, ArXiv.

[29]  Ambuj Tewari,et al.  Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs , 2007, NIPS.