PAC optimal MDP planning with application to invasive species management

In a simulator-defined MDP, the Markovian dynamics and rewards are provided in the form of a simulator from which samples can be drawn. This paper studies MDP planning algorithms that attempt to minimize the number of simulator calls before terminating and outputting a policy that is approximately optimal with high probability. The paper introduces two heuristics for efficient exploration and an improved confidence interval that enables earlier termination with probabilistic guarantees. We prove that the heuristics and the confidence interval are sound and produce with high probability an approximately optimal policy in polynomial time. Experiments on two benchmark problems and two instances of an invasive species management problem show that the improved confidence intervals and the new search heuristics yield reductions of between 8% and 47% in the number of simulator calls required to reach near-optimal policies.

[1]  Shie Mannor,et al.  Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty , 2012, ICML.

[2]  Thomas J. Walsh,et al.  Integrating Sample-Based Planning and Model-Based Reinforcement Learning , 2010, AAAI.

[3]  Thomas G. Dietterich,et al.  Allowing a wildfire to burn: estimating the effect on future fire suppression costs , 2013 .

[4]  Michael H. Bowling,et al.  Apprenticeship learning using linear programming , 2008, ICML '08.

[5]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[7]  Reid G. Simmons,et al.  Focused Real-Time Dynamic Programming for MDPs: Squeezing More Out of a Heuristic , 2006, AAAI.

[8]  Lawrence K. Saul,et al.  Large Deviation Methods for Approximate Probabilistic Inference , 1998, UAI.

[9]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[10]  Thomas G. Dietterich,et al.  PAC Optimal Planning for Invasive Species Management: Improved Exploration for Reinforcement Learning from Simulator-Defined MDPs , 2013, AAAI.

[11]  R. Khan,et al.  Sequential Tests of Statistical Hypotheses. , 1972 .

[12]  Andrea Rinaldo,et al.  On biodiversity in river networks: A trade‐off metapopulation model and comparative analysis , 2007 .

[13]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[14]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[15]  David A. McAllester,et al.  On the Convergence Rate of Good-Turing Estimators , 2000, COLT.

[16]  Csaba Szepesvári,et al.  Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[17]  Lucian Busoniu,et al.  Optimistic planning for Markov decision processes , 2012, AISTATS.

[18]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[19]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[20]  E. Ordentlich,et al.  Inequalities for the L1 Deviation of the Empirical Distribution , 2003 .

[21]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[22]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[23]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[24]  Shie Mannor,et al.  Scaling Up Robust MDPs using Function Approximation , 2014, ICML.

[25]  Claude-Nicolas Fiechter,et al.  Efficient reinforcement learning , 1994, COLT '94.

[26]  Luis E. Ortiz,et al.  Concentration Inequalities for the Missing Mass and for Histogram Rule Error , 2003, J. Mach. Learn. Res..

[27]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[28]  Michael L. Littman,et al.  An empirical evaluation of interval estimation for Markov decision processes , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[29]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[30]  Claude-Nicolas Fiechter,et al.  Design and analysis of efficient reinforcement learning algorithms , 1997 .

[31]  Paul Valiant,et al.  Estimating the Unseen , 2013, NIPS.

[32]  Alon Orlitsky,et al.  Always Good Turing: Asymptotically Optimal Probability Estimation , 2003, Science.

[33]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .