We consider an opportunistic spectrum access (OSA) problem where the time-varying condition of each channel (e.g., as a result of random fading or certain primary users' activities) is modeled as an arbitrary finite-state Markov chain. At each instance of time, a (secondary) user probes a channel and collects a certain reward as a function of the state of the channel (e.g., good channel condition results in higher data rate for the user). Each channel has potentially different state space and statistics, both unknown to the user, who tries to learn which one is the best as it goes and maximizes its usage of the best channel. The objective is to construct a good online learning algorithm so as to minimize the difference between the user's performance in total rewards and that of using the best channel (on average) had it known which one is the best from a priori knowledge of the channel statistics (also known as the regret). This is a classic exploration and exploitation problem and results abound when the reward processes are assumed to be iid. Compared to prior work, the biggest difference is that in our case the reward process is assumed to be Markovian, of which iid is a special case. In addition, the reward processes are restless in that the channel conditions will continue to evolve independent of the user's actions. This leads to a restless bandit problem, for which there exists little result on either algorithms or performance bounds in this learning context to the best of our knowledge. In this paper we introduce an algorithm that utilizes regenerative cycles of a Markov chain and computes a samplemean based index policy, and show that under mild conditions on the state transition probabilities of the Markov chains this algorithm achieves logarithmic regret uniformly over time, and that this regret bound is also optimal. We numerically examine the performance of this algorithm along with a few other learning algorithms in the case of an OSA problem with Gilbert-Elliot channel models, and discuss how this algorithm may be further improved (in terms of its constant) and how this result may lead to similar bounds for other algorithms.
[1]
P. Lezaud.
Chernoff-type bound for finite Markov chains
,
1998
.
[2]
Ao Tang,et al.
Opportunistic Spectrum Access with Multiple Users: Learning under Competition
,
2010,
2010 Proceedings IEEE INFOCOM.
[3]
J. Walrand,et al.
Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards
,
1987
.
[4]
Yi Gai,et al.
Learning Multiuser Channel Allocations in Cognitive Radio Networks: A Combinatorial Multi-Armed Bandit Formulation
,
2010,
2010 IEEE Symposium on New Frontiers in Dynamic Spectrum (DySPAN).
[5]
H. Robbins,et al.
Asymptotically efficient adaptive allocation rules
,
1985
.
[6]
H. Robbins.
Some aspects of the sequential design of experiments
,
1952
.
[7]
J. Gittins.
Bandit processes and dynamic allocation indices
,
1979
.
[8]
Peter Auer,et al.
The Nonstochastic Multiarmed Bandit Problem
,
2002,
SIAM J. Comput..
[9]
Qing Zhao,et al.
Distributed Learning in Multi-Armed Bandit With Multiple Players
,
2009,
IEEE Transactions on Signal Processing.
[10]
Mingyan Liu,et al.
Online algorithms for the multi-armed bandit problem with Markovian rewards
,
2010,
2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).
[11]
Peter Auer,et al.
Finite-time Analysis of the Multiarmed Bandit Problem
,
2002,
Machine Learning.
[12]
Mingyan Liu,et al.
Optimality of Myopic Sensing in Multi-Channel Opportunistic Access
,
2008,
2008 IEEE International Conference on Communications.
[13]
R. Agrawal.
Sample mean based index policies by O(log n) regret for the multi-armed bandit problem
,
1995,
Advances in Applied Probability.
[14]
John Odentrantz,et al.
Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues
,
2000,
Technometrics.