Opportunistic Spectrum Access: Online Search of Optimality

This paper presents an online tuning approach for the ad-hoc reinforcement learning algorithms which are used for solving the exploitation-exploration dilemma of the opportunistic spectrum access, in dynamic environments. These algorithms originate from a well-known problem in computer science: the multi-armed bandit (MAB) problem and they have provided evidence to be viable solutions for the detection and exploration of white spaces in opportunistic spectrum access. Previous work (A. Ben Hadj Alaya-Feki et al., 2008) has shown that the reinforcement learning solutions of the MAB problem are very sensitive to the statistical properties of the wireless medium access and therefore need careful tuning according to the dynamic variations of the wireless environment. This paper deals with the online tuning of those algorithms by proposing and assessing two different approaches: 1-a meta learning approach where a second learner (meta learner) is used to learn the parameters of the base learner, and 2-the Exp3 algorithm that has been previously proposed for dynamical tuning of MAB parameters in other contexts. The simulation results obtained on an IEEE 802.11medium access scenario show that one of the proposed meta-learning methods, namely the change point detection method, achieves much better performance compared to the other methods.

[1]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[2]  E. Moulines,et al.  Dynamic spectrum access with non-stationary Multi-Armed Bandit , 2008, 2008 IEEE 9th Workshop on Signal Processing Advances in Wireless Communications.

[3]  E. S. Page CONTINUOUS INSPECTION SCHEMES , 1954 .

[4]  Kenji Doya,et al.  Meta-learning in Reinforcement Learning , 2003, Neural Networks.

[5]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[6]  Lang Tong,et al.  A Measurement-Based Model for Dynamic Spectrum Access in WLAN Channels , 2006, MILCOM 2006 - 2006 IEEE Military Communications conference.

[7]  Michèle Sebag,et al.  Change Point Detection and Meta-Bandits for Online Learning in Dynamic Environments , 2007 .

[8]  Riyaz T. Sikora Learning Optimal Parameter Values in Dynamic Environment : An Experiment with Softmax Reinforcement Learning Algorithm , 2006 .

[9]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[10]  Michèle Basseville,et al.  Detecting changes in signals and systems - A survey , 1988, Autom..

[11]  Michèle Sebag,et al.  Multi-armed Bandit, Dynamic Environments and Meta-Bandits , 2006 .

[12]  A. S. Xanthopoulos,et al.  Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems , 2008, Appl. Math. Comput..