A convergent actor-critic-based FRL algorithm with application to power management of wireless transmitters

This paper provides the first convergence proof for fuzzy reinforcement learning (FRL) as well as experimental results supporting our analysis. We extend the work of Konda and Tsitsiklis, who presented a convergent actor-critic (AC) algorithm for a general parameterized actor. In our work we prove that a fuzzy rulebase actor satisfies the necessary conditions that guarantee the convergence of its parameters to a local optimum. Our fuzzy rulebase uses Takagi-Sugeno-Kang rules, Gaussian membership functions, and product inference. As an application domain, we chose a difficult task of power control in wireless transmitters, characterized by delayed rewards and a high degree of stochasticity. To the best of our knowledge, no reinforcement learning algorithms have been previously applied to this task. Our simulation results show that the ACFRL algorithm consistently converges in this domain to a locally optimal policy.

[1]  A G Barto,et al.  Toward a modern theory of adaptive networks: expectation and prediction. , 1981, Psychological review.

[2]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  Michio Sugeno,et al.  Fuzzy identification of systems and its applications to modeling and control , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[4]  M. Sugeno,et al.  Structure identification of fuzzy model , 1988 .

[5]  B. Kosko Fuzzy systems as universal approximators , 1992, [1992 Proceedings] IEEE International Conference on Fuzzy Systems.

[6]  Hamid R. Berenji,et al.  Learning and tuning fuzzy logic controllers through reinforcements , 1992, IEEE Trans. Neural Networks.

[7]  Gerald Tesauro,et al.  Practical Issues in Temporal Difference Learning , 1992, Mach. Learn..

[8]  Hamid R. Berenji,et al.  A reinforcement learning--based architecture for fuzzy logic control , 1992, Int. J. Approx. Reason..

[9]  L. Wang,et al.  Fuzzy systems are universal approximators , 1992, [1992 Proceedings] IEEE International Conference on Fuzzy Systems.

[10]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[11]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[12]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[13]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[14]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[15]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[16]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[17]  David Tse,et al.  Power control and capacity of spread spectrum wireless networks , 1999, Autom..

[18]  H.R. Berenji,et al.  Cooperation and coordination between fuzzy reinforcement learning agents in continuous state partially observable Markov decision processes , 1999, FUZZ-IEEE'99. 1999 IEEE International Fuzzy Systems. Conference Proceedings (Cat. No.99CH36315).

[19]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[20]  Vivek S. Borkar,et al.  Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[21]  D. Vengerov,et al.  An Empirical Model of Factor Adjustment Dynamics , 2006 .

[22]  Nicholas Bambos,et al.  Power controlled multiple access (PCMA) in wireless communication networks , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[23]  John N. Tsitsiklis,et al.  Call admission control and routing in integrated services networks using neuro-dynamic programming , 2000, IEEE Journal on Selected Areas in Communications.

[24]  Peter L. Bartlett,et al.  Reinforcement Learning in POMDP's via Direct Gradient Ascent , 2000, ICML.

[25]  Ron Sun,et al.  From implicit skills to explicit knowledge: a bottom-up model of skill learning , 2001, Cogn. Sci..