Natural Gradient Policy for Average Cost SMDP Problem

Semi-markov decision processes (SMDP) are continuous time generalizations of discrete time Markov Decision Process. A number of value and policy iteration algorithms have been developed for the solution of SMDP problem. But solving SMDP problem requires prior knowledge of the deterministic kernels, and suffers from the curse of dimensionality. In this paper, we present the steepest descent direction based on a family of parameterized policies to overcome those limitations. The update rule is based on stochastic policy gradients employing Amari's natural gradient approach that is moving toward choosing a greedy optimal action. We then show considerable performance improvements of this method in the simple two-state SMDP problem and in the more complex SMDP of call admission control problem.

[1]  R. Bellman Dynamic programming. , 1957, Science.

[2]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[3]  Morton E. O'Kelly,et al.  Detecting outliers in irregularly distributed spatial data sets by locally adaptive and robust statistical analysis and GIS , 2001, Int. J. Geogr. Inf. Sci..

[4]  Vijayalakshmi Atluri,et al.  Neighborhood based detection of anomalies in high dimensional spatio-temporal sensor datasets , 2004, SAC '04.

[5]  Shun-ichi Amari,et al.  A Theory of Adaptive Pattern Classifiers , 1967, IEEE Trans. Electron. Comput..

[6]  Chang-Tien Lu,et al.  Detecting spatial outliers with multiple attributes , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[7]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[8]  Chang-Tien Lu,et al.  Algorithms for spatial outlier detection , 2003, Third IEEE International Conference on Data Mining.

[9]  Shashi Shekhar,et al.  Detecting graph-based spatial outliers: algorithms and applications (a summary of results) , 2001, KDD '01.

[10]  W. Tobler A Computer Movie Simulating Urban Growth in the Detroit Region , 1970 .

[11]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[12]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[13]  Michael O. Duff,et al.  Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[14]  Kenji Fukumizu,et al.  Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons , 2000, Neural Computation.

[15]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[16]  Graham J. Williams,et al.  On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms , 2000, KDD '00.

[17]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[18]  Rémi Munos,et al.  Policy Gradient in Continuous Time , 2006, J. Mach. Learn. Res..

[19]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[20]  John N. Tsitsiklis,et al.  Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..

[21]  John N. Tsitsiklis,et al.  Call admission control and routing in integrated services networks using reinforcement learning , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[22]  T. Moon,et al.  Mathematical Methods and Algorithms for Signal Processing , 1999 .

[23]  Keith W. Ross,et al.  Multiservice Loss Models for Broadband Telecommunication Networks , 1997 .

[24]  P. Rousseeuw,et al.  Computing depth contours of bivariate point clouds , 1996 .

[25]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[26]  Arnaud Doucet,et al.  A policy gradient method for semi-Markov decision processes with application to call admission control , 2007, Eur. J. Oper. Res..

[27]  Chang-Tien Lu,et al.  Detecting region outliers in meteorological data , 2003, GIS '03.

[28]  Marco Riani,et al.  The Ordering of Spatial Data and the Detection of Multiple Outliers , 1999 .

[29]  Graham J. Wills,et al.  Dynamic Graphics for Exploring Spatial Data with Application to Locating Global and Local Anomalies , 1991 .

[30]  Abhijit Gosavi,et al.  Reinforcement learning for long-run average cost , 2004, Eur. J. Oper. Res..

[31]  D. Blackwell Discounted Dynamic Programming , 1965 .

[32]  Jeff G. Schneider,et al.  Covariant policy search , 2003, IJCAI 2003.

[33]  Robert Haining,et al.  Spatial Data Analysis in the Social and Environmental Sciences , 1990 .

[34]  H. Vincent Poor,et al.  Integrated voice/data call admission control for wireless DS-CDMA systems , 2002, IEEE Trans. Signal Process..

[35]  D. Blackwell Discrete Dynamic Programming , 1962 .

[36]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[37]  Samuel Karlin,et al.  The structure of dynamic programing models , 1955 .

[38]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.