DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections
暂无分享,去创建一个
Bo Dai | Ofir Nachum | Yinlam Chow | Lihong Li | Lihong Li | Ofir Nachum | Yinlam Chow | Bo Dai
[1] Eduardo F. Morales,et al. An Introduction to Reinforcement Learning , 2011 .
[2] J M Robins,et al. Marginal Mean Models for Dynamic Regimes , 2001, Journal of the American Statistical Association.
[3] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .
[4] Louis Wehenkel,et al. Batch mode reinforcement learning based on the synthesis of artificial trajectories , 2013, Ann. Oper. Res..
[5] Martha White,et al. An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..
[6] Takafumi Kanamori,et al. A Least-squares Approach to Direct Importance Estimation , 2009, J. Mach. Learn. Res..
[7] M. Kawanabe,et al. Direct importance estimation for covariate shift adaptation , 2008 .
[8] Steffen Bickel,et al. Discriminative learning for differing training and test distributions , 2007, ICML '07.
[9] Richard S. Sutton,et al. Weighted importance sampling for off-policy learning with linear function approximation , 2014, NIPS.
[10] R. Tyrrell Rockafellar,et al. Convex Analysis , 1970, Princeton Landmarks in Mathematics and Physics.
[11] Gábor Lugosi,et al. Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.
[12] Shie Mannor,et al. Consistent On-Line Off-Policy Evaluation , 2017, ICML.
[13] Sanjoy Dasgupta,et al. Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.
[14] Philip S. Thomas,et al. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.
[15] Alborz Geramifard,et al. Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.
[16] Csaba Szepesvári,et al. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.
[17] Sergey Levine,et al. Offline policy evaluation across representations with applications to educational games , 2014, AAMAS.
[18] Le Song,et al. Smoothed Dual Embedding Control , 2017, ArXiv.
[19] Alessandro Lazaric,et al. Finite-sample analysis of least-squares policy iteration , 2012, J. Mach. Learn. Res..
[20] Lihong Li,et al. Neural Approaches to Conversational AI , 2019, Found. Trends Inf. Retr..
[21] Emma Brunskill,et al. Off-Policy Policy Gradient with State Distribution Correction , 2019, UAI 2019.
[22] Marc G. Bellemare,et al. Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift , 2019, AAAI.
[23] Le Song,et al. SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.
[24] Takafumi Kanamori,et al. Density Ratio Estimation in Machine Learning , 2012 .
[25] Lihong Li,et al. Toward Minimax Off-policy Value Estimation , 2015, AISTATS.
[26] Martin J. Wainwright,et al. Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization , 2008, IEEE Transactions on Information Theory.
[27] Mehrdad Farajtabar,et al. More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.
[28] Miroslav Dudík,et al. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.
[29] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.
[30] Nan Jiang,et al. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.
[31] Karsten M. Borgwardt,et al. Covariate Shift by Kernel Mean Matching , 2009, NIPS 2009.
[32] Thomas G. Dietterich. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..
[33] W. K. Hastings,et al. Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .
[34] Philip S. Thomas,et al. High-Confidence Off-Policy Evaluation , 2015, AAAI.
[35] David Haussler,et al. Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension , 1995, J. Comb. Theory, Ser. A.
[36] D. Pollard. Convergence of stochastic processes , 1984 .
[37] C. Chu,et al. Semiparametric density estimation under a two-sample density ratio model , 2004 .
[38] Sean R Eddy,et al. What is dynamic programming? , 2004, Nature Biotechnology.
[39] Wei Chu,et al. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.
[40] Alexander Shapiro,et al. Lectures on Stochastic Programming: Modeling and Theory , 2009 .
[41] Y. Qin. Inferences for case-control and semiparametric two-sample density ratio models , 1998 .
[42] Bin Yu. RATES OF CONVERGENCE FOR EMPIRICAL PROCESSES OF STATIONARY MIXING SEQUENCES , 1994 .
[43] Jianfeng Gao,et al. Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.
[44] Qiang Liu,et al. Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.
[45] Le Song,et al. Learning from Conditional Distributions via Dual Embeddings , 2016, AISTATS.
[46] John Langford,et al. Doubly Robust Policy Evaluation and Learning , 2011, ICML.
[47] Jakub W. Pachocki,et al. Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..
[48] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.
[49] John Langford,et al. Off-policy evaluation for slate recommendation , 2016, NIPS.
[50] Lihong Li,et al. Stochastic Variance Reduction Methods for Policy Evaluation , 2017, ICML.