Does the Markov Decision Process Fit the Data: Testing for the Markov Property in Sequential Decision Making

The Markov assumption (MA) is fundamental to the empirical validity of reinforcement learning. In this paper, we propose a novel Forward-Backward Learning procedure to test MA in sequential decision making. The proposed test does not assume any parametric form on the joint distribution of the observed data and plays an important role for identifying the optimal policy in high-order Markov decision processes and partially observable MDPs. We apply our test to both synthetic datasets and a real data example from mobile health studies to illustrate its usefulness.

[1]  Heping Zhang,et al.  Conditional Distance Correlation , 2015, Journal of the American Statistical Association.

[2]  Bernhard Schölkopf,et al.  Kernel-based Conditional Independence Test and Application in Causal Discovery , 2011, UAI.

[3]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[4]  H. White,et al.  A FLEXIBLE NONPARAMETRIC TEST FOR CONDITIONAL INDEPENDENCE , 2013, Econometric Theory.

[5]  Bin Chen,et al.  TESTING FOR THE MARKOV PROPERTY IN TIME SERIES , 2011, Econometric Theory.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Alexandre Belloni,et al.  A high dimensional Central Limit Theorem for martingales, with applications to context tree models , 2018, 1809.02741.

[8]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[9]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[10]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[11]  David Rodbard,et al.  Interpretation of continuous glucose monitoring data: glycemic variability and quality of glycemic control. , 2009, Diabetes technology & therapeutics.

[12]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[13]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[14]  R. Tweedie,et al.  Rates of convergence of the Hastings and Metropolis algorithms , 1996 .

[15]  Yisong Yue,et al.  Batch Policy Learning under Constraints , 2019, ICML.

[16]  Ramachandran S Vasan,et al.  Cohort Profile: The Framingham Heart Study (FHS): overview of milestones in cardiovascular epidemiology. , 2015, International journal of epidemiology.

[17]  L. Tierney Markov Chains for Exploring Posterior Distributions , 1994 .

[18]  R. C. Bradley Basic properties of strong mixing conditions. A survey and some open questions , 2005, math/0511078.

[19]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[20]  Nicolai Meinshausen,et al.  Quantile Regression Forests , 2006, J. Mach. Learn. Res..

[21]  Thomas B. Berrett,et al.  The conditional permutation test for independence while controlling for confounders , 2018, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[22]  Yongmiao Hong,et al.  CHARACTERISTIC FUNCTION BASED TESTING FOR CONDITIONAL INDEPENDENCE: A NONPARAMETRIC REGRESSION APPROACH , 2017, Econometric Theory.

[23]  C. J. Stone,et al.  Consistent Nonparametric Regression , 1977 .

[24]  Cynthia R. Marling,et al.  The OhioT1DM Dataset for Blood Glucose Level Prediction: Update 2020 , 2020, KDH@ECAI.

[25]  Yingbin Liang,et al.  Finite-Sample Analysis for SARSA with Linear Function Approximation , 2019, NeurIPS.

[26]  Peter Bühlmann,et al.  Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm , 2007, J. Mach. Learn. Res..

[27]  Xiaohong Chen,et al.  Optimal Uniform Convergence Rates and Asymptotic Normality for Series Estimators Under Weak Dependence and Weak Conditions , 2014, 1412.6020.

[28]  Johannes Schmidt-Hieber,et al.  Nonparametric regression using deep neural networks with ReLU activation function , 2017, The Annals of Statistics.

[29]  Bernard Bercu,et al.  Exponential inequalities for self-normalized martingales with applications , 2007, 0707.3715.

[30]  J. Robins,et al.  Double/Debiased Machine Learning for Treatment and Structural Parameters , 2017 .

[31]  H. White,et al.  Testing Conditional Independence Via Empirical Likelihood , 2014 .

[32]  C. F. Wu JACKKNIFE , BOOTSTRAP AND OTHER RESAMPLING METHODS IN REGRESSION ANALYSIS ' BY , 2008 .

[33]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[34]  Jalaj Bhandari,et al.  A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation , 2018, COLT.

[35]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[36]  Kengo Kato,et al.  Detailed proof of Nazarov's inequality , 2017, 1711.10696.