论文信息 - Hierarchical Knowledge Gradient for Sequential Sampling - 字舞流文

Hierarchical Knowledge Gradient for Sequential Sampling

We propose a sequential sampling policy for noisy discrete global optimization and ranking and selection, in which we aim to efficiently explore a finite set of alternatives before selecting an alternative as best when exploration stops. Each alternative may be characterized by a multi-dimensional vector of categorical and numerical attributes and has independent normal rewards. We use a Bayesian probability model for the unknown reward of each alternative and follow a fully sequential sampling policy called the knowledge-gradient policy. This policy myopically optimizes the expected increment in the value of sampling information in each time period. We propose a hierarchical aggregation technique that uses the common features shared by alternatives to learn about many alternatives from even a single measurement. This approach greatly reduces the measurement effort required, but it requires some prior knowledge on the smoothness of the function in the form of an aggregation function and computational issues limit the number of alternatives that can be easily considered to the thousands. We prove that our policy is consistent, finding a globally optimal alternative when given enough measurements, and show through simulations that it performs competitively with or significantly better than other policies.

Warren B. Powell | Peter I. Frazier | Martijn R. K. Mes | P. Frazier | Warrren B Powell | M. Mes

[1] R. Bechhofer. A Single-Sample Multiple Decision Procedure for Ranking Means of Normal Populations with known Variances , 1954 .

[2] Harold J. Kushner,et al. A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise , 1964 .

[3] Russell R. Barton,et al. Chapter 18 Metamodel-Based Simulation Optimization , 2006, Simulation.

[4] Warrren B Powell,et al. Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management , 2008 .

[5] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[6] Rémi Munos,et al. Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[7] David B. Dunson,et al. Bayesian Data Analysis , 2010 .

[8] Eric R. Ziegel,et al. The Elements of Statistical Learning , 2003, Technometrics.

[9] E. Vázquez,et al. Convergence properties of the expected improvement algorithm with fixed mean and covariance functions , 2007, 0712.3744.

[10] John Shawe-Taylor,et al. Regret Bounds for Gaussian Process Bandit Problems , 2010, AISTATS 2010.

[11] F. H. Branin. Widely convergent method for finding multiple solutions of simultaneous nonlinear equations , 1972 .

[12] Andreas Krause,et al. Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[13] A. Tamhane. Design and Analysis of Experiments for Statistical Selection, Screening, and Multiple Comparisons , 1995 .

[14] D. Bertsekas,et al. Adaptive aggregation methods for infinite horizon dynamic programming , 1989 .

[15] Eric Walter,et al. An informational approach to the global optimization of expensive-to-evaluate functions , 2006, J. Glob. Optim..

[16] D. Lizotte. Practical bayesian optimization , 2008 .

[17] Jonas Mockus,et al. On Bayesian Methods for Seeking the Extremum , 1974, Optimization Techniques.

[18] Shie Mannor,et al. Action Elimination and Stopping Conditions for Reinforcement Learning , 2003, ICML.

[19] James C. Spall,et al. Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[20] R. Tibshirani,et al. Combining Estimates in Regression and Classification , 1996 .

[21] Robert D. Kleinberg,et al. Online decision problems with large strategy sets , 2005 .

[22] M. Degroot. Optimal Statistical Decisions , 1970 .

[23] Nick Littlestone,et al. From on-line to batch learning , 1989, COLT '89.

[24] Chun-Hung Chen,et al. A gradient approach for smartly allocating computing budget for discrete event simulation , 1996, Winter Simulation Conference.

[25] Warren B. Powell,et al. The Knowledge-Gradient Policy for Correlated Normal Beliefs , 2009, INFORMS J. Comput..

[26] Eli Upfal,et al. Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[27] Michael James Sasena,et al. Flexibility and efficiency enhancements for constrained global design optimization with kriging approximations. , 2002 .

[28] Shie Mannor,et al. PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[29] Yuhong Yang. Adaptive Regression by Mixing , 2001 .

[30] Warren B. Powell,et al. Optimal Learning , 2022, Encyclopedia of Machine Learning and Data Mining.

[31] H. Robbins. A Stochastic Approximation Method , 1951 .

[32] Warren B. Powell,et al. An Approximate Dynamic Programming Algorithm for Large-Scale Fleet Management: A Case Application , 2009, Transp. Sci..

[33] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[34] Csaba Szepesvári,et al. Online Optimization in X-Armed Bandits , 2008, NIPS.

[35] C. N Bouza,et al. Spall, J.C. Introduction to stochastic search and optimization. Estimation, simulation and control. Wiley Interscience Series in Discrete Mathematics and Optimization, 2003 , 2004 .

[36] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[37] Jürgen Branke,et al. Sequential Sampling to Myopically Maximize the Expected Value of Information , 2010, INFORMS J. Comput..

[38] Csaba Szepesvári,et al. Empirical Bernstein stopping , 2008, ICML '08.

[39] Howard Raiffa,et al. Applied Statistical Decision Theory. , 1961 .

[40] Donald R. Jones,et al. Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[41] Thomas P. Hayes,et al. High-Probability Regret Bounds for Bandit Online Linear Optimization , 2008, COLT.

[42] Elad Hazan,et al. Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.

[43] James R. Evans,et al. Aggregation and Disaggregation Techniques and Methodology in Optimization , 1991, Oper. Res..

[44] D. Solomon,et al. Applied Statistical Decision Theory. , 1961 .

[45] Frank Hutter,et al. Automated configuration of algorithms for solving hard computational problems , 2009 .

[46] N. Zheng,et al. Global Optimization of Stochastic Black-Box Systems via Sequential Kriging Meta-Models , 2006, J. Glob. Optim..

[47] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[48] T. Lai. Adaptive treatment allocation and the multi-armed bandit problem , 1987 .

[49] Roel Bosker,et al. Multilevel analysis : an introduction to basic and advanced multilevel modeling , 1999 .

[50] Adam Tauman Kalai,et al. Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[51] Leslie Pack Kaelbling,et al. Learning in embedded systems , 1993 .

[52] S. Gupta,et al. Bayesian look ahead one-stage sampling allocations for selection of the best population , 1996 .

[53] Warren B. Powell,et al. A Knowledge-Gradient Policy for Sequential Information Collection , 2008, SIAM J. Control. Optim..

[54] Chun-Hung Chen,et al. Opportunity Cost and OCBA Selection Procedures in Ordinal Optimization for a Fixed Number of Alternative Systems , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[55] Nando de Freitas,et al. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[56] Stephen E. Chick,et al. New Two-Stage and Sequential Procedures for Selecting the Best Simulated System , 2001, Oper. Res..

[57] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[58] David Lindley,et al. Optimal Statistical Decisions , 1971 .

[59] Shai Shalev-Shwartz,et al. Online learning: theory, algorithms and applications (למידה מקוונת.) , 2007 .

[60] Russell Greiner,et al. The Budgeted Multi-armed Bandit Problem , 2004, COLT.