Due to the excessive need of wireless spectrum and the inefficiency in utilizing it, the technology of cognitive radio (CR) addresses the issue of allowing unlicensed users to make use of the frequency bands where licensed users is currently not active. From the hierarchical structure, CR users can only grab resources under the premise of not interfering with the normal operation of the primary system (PS), and this extra constraint complicates the original time-varying wireless communication. By assuming Bernoulli distribution for each channel and independence across channels, this dynamic spectrum access scheme can be treated as a multi-armed bandit (MAB) problem for a single CR user, where each channel is considered as a slot machine with some expected reward, and this user is trying to get as much available bandwidth as possible. The key component of MAB problem is the tradeoff between exploitation and exploration, where the CR terminal tries to pick the channel that has highest estimated reward from past history, and look for new channels that might give even higher rewards at the same time. There are different versions of MAB formulation. In the case of stationary distribution, Gittins index is shown to be the optimal strategy for discounted MAB in [6], and [8] apply it to CR. By allowing channel distributions to change over time, Whittle’s index is proved to be asymptotically optimal under some constraints in [12], and it is shown in [9] that opportunistic spectrum access is indexable and hence able to apply this strategy. However, the above approaches both assume infinite horizon and maximize discounted reward, whereas in the wireless environment, we only care about the reward obtained in a finite observation period, which leads us to the finite-time MAB introduced in [3] and others. So far there is no optimal strategy to our knowledge, and we would refer to different finite-time algorithms with tuned parameters. In this paper we basically follow the algorithms in [1] and [11], and proceeds as follows: In section 2 we describe the network model in detail, and in section 3 we examine some common finite-time MAB algorithms. Numerical simulations are provided in section 4 to compare algorithms in different probability distributions, and followed by the conclusion as well as possible extensions in section 5.
[1]
Ananthram Swami,et al.
Decentralized cognitive MAC for opportunistic spectrum access in ad hoc networks: A POMDP framework
,
2007,
IEEE Journal on Selected Areas in Communications.
[2]
Qing Zhao,et al.
A Restless Bandit Formulation of Opportunistic Access: Indexablity and Index Policy
,
2008,
2008 5th IEEE Annual Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks Workshops.
[3]
Ahmed Sultan,et al.
Blind Cognitive MAC Protocols
,
2009,
2009 IEEE International Conference on Communications.
[4]
E. Moulines,et al.
Dynamic spectrum access with non-stationary Multi-Armed Bandit
,
2008,
2008 IEEE 9th Workshop on Signal Processing Advances in Wireless Communications.
[5]
Peter Auer,et al.
Finite-time Analysis of the Multiarmed Bandit Problem
,
2002,
Machine Learning.
[6]
Nicolò Cesa-Bianchi,et al.
Finite-Time Regret Bounds for the Multiarmed Bandit Problem
,
1998,
ICML.
[7]
Aurélien Garivier,et al.
On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems
,
2008,
0805.3415.
[8]
Mehryar Mohri,et al.
Multi-armed Bandit Algorithms and Empirical Evaluation
,
2005,
ECML.
[9]
Csaba Szepesvári,et al.
Tuning Bandit Algorithms in Stochastic Environments
,
2007,
ALT.
[10]
P. Whittle.
Restless bandits: activity allocation in a changing world
,
1988,
Journal of Applied Probability.