Fast Online Learning of Antijamming and Jamming Strategies

Competing Cognitive Radio Network (CCRN) coalesces communicator (comm) nodes and jammers to achieve maximal networking efficiency against adversarial threats. We have previously developed two contrasting approaches based on multiarmed bandit (MAB) and value-iterated Q-learning. Despite their differences, both approaches have demonstrated the efficacy of applying a machine learning technique to jointly compute comm and jammer actions in hypothetical two-network competition for an open dynamic spectrum. When sampled channel reward characteristics are time-invariant-i.e., stationarity of learned information, both MAB and Q-learning based strategies have resulted in the best possible reward empirically.

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[3]  K. J. Ray Liu,et al.  An anti-jamming stochastic game for cognitive radio networks , 2011, IEEE Journal on Selected Areas in Communications.

[4]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[5]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[6]  Michael P. Wellman,et al.  Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm , 1998, ICML.

[7]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[8]  P. Whittle Restless bandits: activity allocation in a changing world , 1988, Journal of Applied Probability.

[9]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[10]  L. Haan,et al.  Extreme value theory : an introduction , 2006 .

[11]  H. T. Kung,et al.  Competing Mobile Network Game: Embracing antijamming and jamming strategies with reinforcement learning , 2013, 2013 IEEE Conference on Communications and Network Security (CNS).

[12]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[13]  Richard S. Sutton,et al.  Reinforcement Learning , 1992, Handbook of Machine Learning.

[14]  Michael L. Littman,et al.  Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.

[15]  R. Bellman Dynamic programming. , 1957, Science.

[16]  Ronald L. Rivest,et al.  Simulation results for a new two-armed bandit heuristic , 1994, Annual Conference Computational Learning Theory.

[17]  J. Walrand,et al.  Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .

[18]  R. Bellman A PROBLEM IN THE SEQUENTIAL DESIGN OF EXPERIMENTS , 1954 .

[19]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[20]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[21]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[22]  H. T. Kung,et al.  Optimizing media access strategy for competing cognitive radio networks , 2013, 2013 IEEE Global Communications Conference (GLOBECOM).