Simulation-based Uniform Value Function Estimates of Markov Decision Processes

The value function of a Markov decision process (MDP) assigns to each policy its expected discounted reward. This expected reward can be estimated as the empirical average of the reward over many independent simulation runs. We derive bounds on the number of runs needed for the uniform convergence of the empirical average to the expected reward for a class of policies, in terms of the Vapnik-Chervonenkis or P-dimension of the policy class. Further, we show through a counterexample that whether we get uniform convergence or not for an MDP depends on the simulation method used. Uniform convergence results are also obtained for the average-reward case, for partially observed Markov decision processes, and can be easily extended to Markov games. The results can be viewed as a contribution to empirical process theory and as an extension of the probably approximately correct (PAC) learning theory for partially observable MDPs and Markov games.

[1]  Eitan Altman,et al.  Rate of Convergence of Empirical Measures and Costs in Controlled Markov Chains and Transient Optimality , 1994, Math. Oper. Res..

[2]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[3]  Persi Diaconis,et al.  Iterated Random Functions , 1999, SIAM Rev..

[4]  David Gamarnik Extension of the PAC framework to finite and countable Markov chains , 2003, IEEE Trans. Inf. Theory.

[5]  M. Ledoux The concentration of measure phenomenon , 2001 .

[6]  Shaler Stidham,et al.  Optimal Control of Markov Chains , 2000 .

[7]  V. Vapnik,et al.  Necessary and Sufficient Conditions for the Uniform Convergence of Means to their Expectations , 1982 .

[8]  Martin Pesendorfer,et al.  Identification and Estimation of Dynamic Games , 2003 .

[9]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[10]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[11]  V. Borkar Topics in controlled Markov chains , 1991 .

[12]  Mathukumalli Vidyasagar,et al.  System identification: a learning theory approach , 2001, Proceedings of the 40th IEEE Conference on Decision and Control (Cat. No.01CH37228).

[13]  P. Massart Some applications of concentration inequalities to statistics , 2000 .

[14]  Mathukumalli Vidyasagar,et al.  Learning And Generalization , 2002 .

[15]  M. Talagrand Concentration of measure and isoperimetric inequalities in product spaces , 1994, math/9406212.

[16]  A. Dembo,et al.  A note on uniform laws of averages for dependent processes , 1993 .

[17]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[18]  D. Pollard Empirical Processes: Theory and Applications , 1990 .

[19]  D. Pollard Convergence of stochastic processes , 1984 .

[20]  Kazuoki Azuma WEIGHTED SUMS OF CERTAIN DEPENDENT RANDOM VARIABLES , 1967 .

[21]  V. Bentkus On Hoeffding’s inequalities , 2004, math/0410159.

[22]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[23]  J. Steele Probability theory and combinatorial optimization , 1987 .

[24]  Umesh V. Vazirani,et al.  A Markovian extension of Valiant's learning model , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[25]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[26]  John N. Tsitsiklis,et al.  On the Empirical State-Action Frequencies in Markov Decision Processes Under General Policies , 2005, Math. Oper. Res..

[27]  P. Doukhan Mixing: Properties and Examples , 1994 .

[28]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1997, JACM.

[29]  Bin Yu RATES OF CONVERGENCE FOR EMPIRICAL PROCESSES OF STATIONARY MIXING SEQUENCES , 1994 .

[30]  Paul-Marie Samson,et al.  Concentration of measure inequalities for Markov chains and $\Phi$-mixing processes , 2000 .

[31]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[32]  Sean P. Meyn,et al.  Relative entropy and exponential deviation bounds for general Markov chains , 2005, Proceedings. International Symposium on Information Theory, 2005. ISIT 2005..

[33]  S. Shankar Sastry,et al.  Decentralized nonlinear model predictive control of multiple flying robots , 2003, 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475).

[34]  John N. Tsitsiklis,et al.  Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes , 2003, Discret. Event Dyn. Syst..

[35]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[36]  J. K. Hunter,et al.  Measure Theory , 2007 .

[37]  Leonid Peshkin,et al.  Bounds on Sample Size for Policy Evaluation in Markov Environments , 2001, COLT/EuroCOLT.

[38]  Mathukumalli Vidyasagar,et al.  Learning and Generalization: With Applications to Neural Networks , 2002 .

[39]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[40]  A. W. van der Vaart,et al.  Uniform Central Limit Theorems , 2001 .

[41]  A. Kolmogorov,et al.  Entropy and "-capacity of sets in func-tional spaces , 1961 .

[42]  M. Talagrand A new look at independence , 1996 .

[43]  P. Kumar,et al.  Learning dynamical systems in a stationary environment , 1998 .

[44]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[45]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[46]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[47]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[48]  S. Geer On Hoeffding's Inequality for Dependent Random Variables , 2002 .

[49]  Alon Itai,et al.  Learnability with Respect to Fixed Distributions , 1991, Theor. Comput. Sci..

[50]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..