Reinforcement Learning for Physical Layer Communications

Wireless communication systems have to be designed in order to cope with time-frequency-space varying channel conditions and variety of interference sources. In cellular wireless systems for instance, channel is estimated regularly by mobile terminals and base stations (BS) using dedicated pilot signals. This allows for adapting the transmitters and receivers to the current channel conditions and interference scenario. Powerful adaptive signal processing algorithms have been developed in the past decades in order to cope with the dynamic nature of the wireless channel, e.g. the least mean square and recursive least square algorithms for channel equalization or estimation, the Kalman filtering in multiple-input multiple-output channel matrix and frequency offset tracking. These techniques rely on very well established mathematical models of physical phenomena that allow to derive the optimal processing for a given criterion, e.g. mean square error and assumed noise and interference distribution models. Any mathematical model trades-off between its complexity and its tractability. A very complete, and hence complex, model may be useless if any insight on the state of the system cannot be drawn easily. For instance, the wireless propagation channel is absolutely deterministic and the signal received at any point of the space at any time can be precisely predicted by the Maxwell equations. However, this would require a prohibitive amount of computation and memory storage for a receiver to calculate at any point the value of the electric and magnetic fields using detailed and explicit knowledge of the physical characteristics of scatterers in the propagation environment, e.g. the dielectric and permittivity constants of the walls and other obstacles. It is much more efficient to design receivers that perform well in environments that have been stochastically characterized instead of using explicit deterministic model of each particular propagation environment. Modern and emerging wireless systems are characterized by massive amounts of connected mobile devices, BS, sensors and actuators. Modeling such large scale wireless systems has become a formidable task because of, for example, very small cell sizes, channel aware link adaptation and waveform deployment, diversity techniques and optimization of the use of different degrees of freedom in tranceivers. Consequently, it may not be feasible to build explicit and detailed mathematical models of wireless systems and their operational environments. In fact, there is a serious modeling deficit that calls for creating awareness of the operational wireless environment through sensing and learning. Machine learning (ML) refers to a large class of algorithms that aim at giving to a machine the capability to acquire knowledge or behavior. If the machine is a wireless system, which is man-made, then

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  Christophe MOY IoTligent: First World-Wide Implementation of Decentralized Spectrum Learning for IoT Wireless Networks , 2019, 2019 URSI Asia-Pacific Radio Science Conference (AP-RASC).

[3]  Stephan ten Brink,et al.  OFDM-Autoencoder for End-to-End Learning of Communications Systems , 2018, 2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC).

[4]  Yonghui Song,et al.  A New Deep-Q-Learning-Based Transmission Scheduling Mechanism for the Cognitive Internet of Things , 2018, IEEE Internet of Things Journal.

[5]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[6]  Jacques Palicot,et al.  Proof-of-Concept System for Opportunistic Spectrum Access in Multi-user Decentralized Networks , 2016, EAI Endorsed Trans. Cogn. Commun..

[7]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[8]  Mihaela van der Schaar,et al.  Joint Physical-Layer and System-Level Power Management for Delay-Sensitive Wireless Communications , 2013, IEEE Transactions on Mobile Computing.

[9]  Randy Paffenroth,et al.  Multiobjective Reinforcement Learning for Cognitive Satellite Communications Using Deep Neural Network Ensembles , 2018, IEEE Journal on Selected Areas in Communications.

[10]  Kobi Cohen,et al.  Deep Multi-User Reinforcement Learning for Distributed Dynamic Spectrum Access , 2017, IEEE Transactions on Wireless Communications.

[11]  Jean-Marie Gorce,et al.  An Upper Bound on the Error Induced by Saddlepoint Approximations—Applications to Information Theory † , 2020, Entropy.

[12]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[13]  Ashutosh Sabharwal,et al.  Delay-bounded packet scheduling of bursty traffic over wireless channels , 2004, IEEE Transactions on Information Theory.

[14]  R. Munos,et al.  Kullback–Leibler upper confidence bounds for optimal sequential allocation , 2012, 1210.1136.

[15]  Christophe Moy,et al.  Transfer restless multi-armed bandit policy for energy-efficient heterogeneous cellular network , 2019, EURASIP J. Adv. Signal Process..

[16]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[17]  Gilles Stoltz Incomplete information and internal regret in prediction of individual sequences , 2005 .

[18]  Maryline Hélard,et al.  Energy Minimization in HARQ-I Relay-Assisted Networks With Delay-Limited Users , 2017, IEEE Transactions on Vehicular Technology.

[19]  Abhijeet Bhorkar,et al.  An on-line learning algorithm for energy efficient delay constrained scheduling over a fading channel , 2008, IEEE Journal on Selected Areas in Communications.

[20]  H. Vincent Poor,et al.  Channel Coding Rate in the Finite Blocklength Regime , 2010, IEEE Transactions on Information Theory.

[21]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[22]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[23]  Csaba Szepesvári,et al.  Learning and Exploitation Do Not Conflict Under Minimax Optimality , 1997, ECML.

[24]  Vinod Sharma,et al.  Power constrained and delay optimal policies for scheduling transmission over a fading channel , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[25]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[26]  D. Ernst,et al.  Upper Confidence Bound Based Decision Making Strategies and Dynamic Spectrum Access , 2010, 2010 IEEE International Conference on Communications.

[27]  Christophe Moy,et al.  QoS Driven Channel Selection Algorithm for Cognitive Radio Network: Multi-User Multi-Armed Bandit Approach , 2017, IEEE Transactions on Cognitive Communications and Networking.

[28]  H. Vincent Poor,et al.  Spectrum Exploration and Exploitation for Cognitive Radio: Recent Advances , 2015, IEEE Signal Processing Magazine.

[29]  Erik G. Larsson,et al.  Spectrum sensing for cognitive radio : State-ofthe-art and recent advances , 2012 .

[30]  Bhaskar Krishnamachari,et al.  Deep Reinforcement Learning for Dynamic Multichannel Access in Wireless Networks , 2018, IEEE Transactions on Cognitive Communications and Networking.

[31]  Christophe Moy,et al.  Reinforcement Learning Real Experiments for Opportunistic Spectrum Access , 2014 .

[32]  Walid Saad,et al.  Proactive Resource Management for LTE in Unlicensed Spectrum: A Deep Learning Perspective , 2017, IEEE Transactions on Wireless Communications.

[33]  Laurent Toutain,et al.  Decentralized spectrum learning for radio collision mitigation in ultra-dense IoT networks: LoRaWAN case study and experiments , 2020, Ann. des Télécommunications.

[34]  Stephan ten Brink,et al.  Deep Learning Based Communication Over the Air , 2017, IEEE Journal of Selected Topics in Signal Processing.

[35]  Mingyan Liu,et al.  Online learning in opportunistic spectrum access: A restless bandit approach , 2010, 2011 Proceedings IEEE INFOCOM.

[36]  Alireza Sadeghi,et al.  Optimal and Scalable Caching for 5G Using Reinforcement Learning of Space-Time Popularities , 2017, IEEE Journal of Selected Topics in Signal Processing.

[37]  Zhenyu Liao,et al.  A Random Matrix Approach to Neural Networks , 2017, ArXiv.

[38]  Vincent K. N. Lau,et al.  Cross-Layer Design for OFDMA Wireless Systems With Heterogeneous Delay Requirements , 2007, IEEE Transactions on Wireless Communications.

[39]  Vikram Krishnamurthy,et al.  Monotonicity of Constrained Optimal Transmission Policies in Correlated Fading Channels With ARQ , 2010, IEEE Transactions on Signal Processing.

[40]  Ananthram Swami,et al.  Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret , 2010, IEEE Journal on Selected Areas in Communications.

[41]  Qi Hao,et al.  Deep Learning for Intelligent Wireless Networks: A Comprehensive Survey , 2018, IEEE Communications Surveys & Tutorials.

[42]  Santiago Zazo,et al.  Hybrid UCB-HMM: A Machine Learning Strategy for Cognitive Radio in HF Band , 2015, IEEE Transactions on Cognitive Communications and Networking.

[43]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[44]  Senem Velipasalar,et al.  Deep Reinforcement Learning-Based Edge Caching in Wireless Networks , 2020, IEEE Transactions on Cognitive Communications and Networking.

[45]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[46]  Ying-Chang Liang,et al.  Applications of Deep Reinforcement Learning in Communications and Networking: A Survey , 2018, IEEE Communications Surveys & Tutorials.

[47]  J. Walrand,et al.  Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .

[48]  Wassim Jouini,et al.  Multi-armed bandit based policies for cognitive radio's decision making issues , 2009, 2009 3rd International Conference on Signals, Circuits and Systems (SCS).

[49]  Shuguang Cui,et al.  Reinforcement Learning-Based Multiaccess Control and Battery Prediction With Energy Harvesting in IoT Systems , 2018, IEEE Internet of Things Journal.

[50]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[51]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[52]  Victor C. M. Leung,et al.  Deep-Reinforcement-Learning-Based Optimization for Cache-Enabled Opportunistic Interference Alignment Wireless Networks , 2017, IEEE Transactions on Vehicular Technology.

[53]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[54]  Mérouane Debbah,et al.  Wireless Networks Design in the Era of Deep Learning: Model-Based, AI-Based, or Both? , 2019, IEEE Transactions on Communications.

[55]  H. Vincent Poor,et al.  A sensing policy based on confidence bounds and a restless multi-armed bandit model , 2012, 2012 Conference Record of the Forty Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR).

[56]  Bhaskar Krishnamachari,et al.  Dynamic Base Station Switching-On/Off Strategies for Green Cellular Networks , 2013, IEEE Transactions on Wireless Communications.

[57]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[58]  Ursula Challita,et al.  Artificial Neural Networks-Based Machine Learning for Wireless Networks: A Tutorial , 2017, IEEE Communications Surveys & Tutorials.

[59]  Emilie Kaufmann,et al.  Analysis of bayesian and frequentist strategies for sequential resource allocation. (Analyse de stratégies bayésiennes et fréquentistes pour l'allocation séquentielle de ressources) , 2014 .

[60]  Visa Koivunen,et al.  An Order Optimal Policy for Exploiting Idle Spectrum in Cognitive Radio Networks , 2015, IEEE Transactions on Signal Processing.

[61]  Jakob Hoydis,et al.  An Introduction to Deep Learning for the Physical Layer , 2017, IEEE Transactions on Cognitive Communications and Networking.

[62]  Visa Koivunen,et al.  Bayesian Methods for Multiple Change-Point Detection With Reduced Communication , 2020, IEEE Transactions on Signal Processing.

[63]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[64]  Mingyan Liu,et al.  Online Learning of Rested and Restless Bandits , 2011, IEEE Transactions on Information Theory.