Sample Efficient Policy Search for Optimal Stopping Domains

Optimal stopping problems consider the question of deciding when to stop an observation-generating process in order to maximize a return. We examine the problem of simultaneously learning and planning in such domains, when data is collected directly from the environment. We propose GFSE, a simple and flexible model-free policy search method that reuses data for sample efficiency by leveraging problem structure. We bound the sample complexity of our approach to guarantee uniform convergence of policy value estimates, tightening existing PAC bounds to achieve logarithmic dependence on horizon length for our setting. We also examine the benefit of our method against prevalent model-based and model-free approaches on 3 domains taken from diverse fields.

[1]  G. Gallego,et al.  Optimal starting times for end-of-season sales and optimal stopping times for promotional fares , 1995 .

[2]  Volume 40 , 1990 .

[3]  Shlomo Zilberstein,et al.  Operational Rationality through Compilation of Anytime Algorithms , 1995, AI Mag..

[4]  P. Moerbeke On optimal stopping and free boundary problems , 1973, Advances in Applied Probability.

[5]  M. Rothschild,et al.  Towards an Economic Theory of Replacement Investment , 1974 .

[6]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[7]  Jing Zhang,et al.  EDUCATIONAL DATA MINING , 2016 .

[8]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[9]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[10]  Sergey Levine,et al.  Offline policy evaluation across representations with applications to educational games , 2014, AAMAS.

[11]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[12]  Alan Fern,et al.  Using trajectory data to improve bayesian optimization for reinforcement learning , 2014, J. Mach. Learn. Res..

[13]  R. Charles Murray,et al.  Reducing the Knowledge Tracing Space , 2009, EDM.

[14]  Warren B. Powell,et al.  An Approximate Dynamic Programming Algorithm for Monotone Value Functions , 2014, Oper. Res..

[15]  S. Jacka Optimal Stopping and the American Put , 1991 .

[16]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[17]  Pierre Rochus,et al.  Hu, Luojia , Estimation of a censored dynamic panel data model,Econometrica. Journal of the Econometric Society , 2002 .

[18]  Wolfram Martens,et al.  A Spatiotemporal Optimal Stopping Problem for Mission Monitoring with Stationary Viewpoints , 2015, Robotics: Science and Systems.

[19]  Donald R. Haurin,et al.  Selling Time and Selling Price: The Influence of Seller Motivation , 1998 .

[20]  L. Goddard,et al.  Operations Research (OR) , 2007 .

[21]  S. Lippman,et al.  THE ECONOMICS OF JOB SEARCH: A SURVEY* , 1976 .

[22]  Oren Etzioni,et al.  To buy or not to buy: mining airfare data to minimize ticket purchase price , 2003, KDD '03.

[23]  Maria L. Gini,et al.  On Optimizing Airline Ticket Purchase Timing , 2015, ACM Trans. Intell. Syst. Technol..

[24]  Thomas S. Ferguson,et al.  Who Solved the Secretary Problem , 1989 .

[25]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[26]  Ryan Shaun Joazeiro de Baker,et al.  New Potentials for Data-Driven Intelligent Tutoring System Development and Optimization , 2013, AI Mag..

[27]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[28]  M. D. Wilkinson,et al.  Management science , 1989, British Dental Journal.

[29]  John R. Anderson,et al.  Knowledge tracing: Modeling the acquisition of procedural knowledge , 2005, User Modeling and User-Adapted Interaction.

[30]  John Rust Optimal Replacement of GMC Bus Engines: An Empirical Model of Harold Zurcher , 1987 .

[31]  S. Chipman,et al.  Cognitively diagnostic assessment , 1995 .

[32]  Ernesto Mordecki,et al.  Optimal stopping and perpetual options for Lévy processes , 2002, Finance Stochastics.