Optimizing the Long-Term Average Reward for Continuing MDPs: A Technical Report

Recently, we have shaken the balance between the information freshness, in terms of age of information (AoI), experienced by users and energy consumed by sensors, by appropriately activating sensors to update their current status in caching enabled Internet of Things (IoT) networks [1]. To solve this problem, we cast the corresponding status update procedure as a continuing Markov Decision Process (MDP) (i.e., without termination states), where the number of state-action pairs increases exponentially with respect to the number of considered sensors and users. Moreover, to circumvent the curse of dimensionality, we have established a methodology for designing deep reinforcement learning (DRL) algorithms to maximize (resp. minimize) the average reward (resp. cost), by integrating R-learning, a tabular reinforcement learning (RL) algorithm tailored for maximizing the long-term average reward, and traditional DRL algorithms, initially developed to optimize the discounted long-term cumulative reward rather the average one. In this technical report, we would present detailed discussions on the technical contributions of this methodology. Index Terms Continuing MDP, deep reinforcement learning, long-term average reward, discounted long-term cumulative reward.

[1]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[2]  Tony Q. S. Quek,et al.  Optimal Status Update for Caching Enabled IoT Networks: A Dueling Deep R-Network Approach , 2021, IEEE Transactions on Wireless Communications.

[3]  Victor C. M. Leung,et al.  Software-Defined Networks with Mobile Edge Computing and Caching for Smart Cities: A Big Data Deep Reinforcement Learning Approach , 2017, IEEE Communications Magazine.

[4]  Zhu Han,et al.  Data Freshness and Energy-Efficient UAV Navigation Optimization: A Deep Reinforcement Learning Approach , 2020, IEEE Transactions on Intelligent Transportation Systems.

[5]  Ying-Chang Liang,et al.  Intelligent Sharing for LTE and WiFi Systems in Unlicensed Bands: A Deep Reinforcement Learning Approach , 2020, IEEE Transactions on Communications.

[6]  Mohsen Guizani,et al.  Internet of Things: A Survey on Enabling Technologies, Protocols, and Applications , 2015, IEEE Communications Surveys & Tutorials.

[7]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[8]  Geoffrey Ye Li,et al.  Learn to Compress CSI and Allocate Resources in Vehicular Networks , 2019, IEEE Transactions on Communications.

[9]  Manyou Ma,et al.  A Deep Reinforcement Learning Approach for Dynamic Contents Caching in HetNets , 2020, ICC 2020 - 2020 IEEE International Conference on Communications (ICC).

[10]  Harpreet S. Dhillon,et al.  A Reinforcement Learning Framework for Optimizing Age of Information in RF-Powered Communication Systems , 2019, IEEE Transactions on Communications.

[11]  Richard S. Sutton,et al.  Discounted Reinforcement Learning is Not an Optimization Problem , 2019, ArXiv.

[12]  E. Feinberg,et al.  Examples Concerning Abelian and Cesaro Limits , 2013, 1310.2482.

[13]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14]  Lingyang Song,et al.  A novel caching mechanism for Internet of Things (IoT) sensing service with energy harvesting , 2016, 2016 IEEE International Conference on Communications (ICC).