Stochastic dilemmas: foundations and applications

OF THE DISSERTATION STOCHASTIC DILEMMAS: FOUNDATIONS AND APPLICATIONS by SERGIU GOSCHIN Dissertation Director: Michael L. Littman and Haym Hirsh One of the significant challenges when solving optimization problems is addressing possible inaccurate or inconsistent function evaluations. Surprisingly and interestingly, this problem is far from trivial even in one of the most basic possible settings: evaluating which of two options is better when the values of the two options are random variables (a stochastic dilemma). Problems in this space have often been studied in the statistics, operations research and computer-science communities under the name of ”multi-armed bandits”. While most of the previous work has focused on dealing with noise in an online setting, in this dissertation, I will focus on offline optimization where the goal is to return a reasonable solution with high probability using a finite number of samples. I will discuss a set of problem settings of increasing complexity that allow one to derive formal algorithmic bounds. I will point to and discuss interesting connections between stochastic optimization and noisy data

[1]  L. Addario-Berry,et al.  Ballot Theorems, Old and New , 2008 .

[2]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[3]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[4]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[5]  Inc. Alias-i Multilevel Bayesian Models of Categorical Data Annotation , 2008 .

[6]  Alex M. Andrew,et al.  Reinforcement Learning: : An Introduction , 1998 .

[7]  David H. Ackley,et al.  The effects of selection on noisy fitness optimization , 2011, GECCO '11.

[8]  Csaba Szepesvári,et al.  Empirical Bernstein stopping , 2008, ICML '08.

[9]  András Lörincz,et al.  Learning Tetris Using the Noisy Cross-Entropy Method , 2006, Neural Computation.

[10]  H. D. Miller Combinatorial methods in the theory of stochastic processes , 1968, Comput. J..

[11]  Julian Togelius,et al.  The 2009 Mario AI Competition , 2010, IEEE Congress on Evolutionary Computation.

[12]  O. Kallenberg Ballot theorems and Sojourn laws for stationary processes , 1999 .

[13]  Michael D. Vose,et al.  The simple genetic algorithm - foundations and theory , 1999, Complex adaptive systems.

[14]  R. Rubinstein The Cross-Entropy Method for Combinatorial and Continuous Optimization , 1999 .

[15]  Rémi Munos,et al.  Open Loop Optimistic Planning , 2010, COLT.

[16]  Balázs Kégl,et al.  Surrogating the surrogate: accelerating Gaussian-process-based global optimization with a mixture cross-entropy algorithm , 2010, ICML.

[17]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[18]  Steven I. Marcus,et al.  Simulation-based Algorithms for Markov Decision Processes/ Hyeong Soo Chang ... [et al.] , 2013 .

[19]  R. Munos,et al.  Best Arm Identification in Multi-Armed Bandits , 2010, COLT.

[20]  Anne Auger,et al.  Real-Parameter Black-Box Optimization Benchmarking 2009: Noiseless Functions Definitions , 2009 .

[21]  Csaba Szepesvári,et al.  Online Optimization in X-Armed Bandits , 2008, NIPS.

[22]  Shimon Whiteson,et al.  The Reinforcement Learning Competitions , 2010 .

[23]  Dirk P. Kroese,et al.  Application of the Cross-Entropy Method to the Buffer Allocation Problem in a Simulation-Based Environment , 2005, Ann. Oper. Res..

[24]  Eric Horvitz,et al.  Combining human and machine intelligence in large-scale crowdsourcing , 2012, AAMAS.

[25]  Michael L. Littman,et al.  The Cross-Entropy Method Optimizes for Quantiles , 2013, ICML.

[26]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[27]  Andrew W. Moore,et al.  The Racing Algorithm: Model Selection for Lazy Learners , 1997, Artificial Intelligence Review.

[28]  Christian Igel,et al.  Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search , 2009, ICML '09.

[29]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[30]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[31]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[32]  Shie Mannor,et al.  The cross entropy method for classification , 2005, ICML.

[33]  J. Fitzpatrick,et al.  Genetic Algorithms in Noisy Environments , 2005, Machine Learning.

[34]  Nikolaus Hansen,et al.  Completely Derandomized Self-Adaptation in Evolution Strategies , 2001, Evolutionary Computation.

[35]  C. Lintott,et al.  Galaxy Zoo 2: detailed morphological classifications for 304,122 galaxies from the Sloan Digital Sky Survey , 2013, 1308.3496.

[36]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[37]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[38]  Jaime G. Carbonell,et al.  Proactive learning: cost-sensitive active learning with multiple imperfect oracles , 2008, CIKM '08.

[39]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[40]  Rémi Munos,et al.  Algorithms for Infinitely Many-Armed Bandits , 2008, NIPS.

[41]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[42]  Michael I. Jordan,et al.  Bayesian Bias Mitigation for Crowdsourcing , 2011, NIPS.

[43]  Ran Canetti,et al.  Lower Bounds for Sampling Algorithms for Estimating the Average , 1995, Inf. Process. Lett..

[44]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..

[45]  Peng Dai,et al.  Artificial Intelligence for Artificial Artificial Intelligence , 2011, AAAI.

[46]  Michael L. Littman,et al.  Planning in Reward-Rich Domains via PAC Bandits , 2012, EWRL.

[47]  Andre Cohen,et al.  An object-oriented representation for efficient reinforcement learning , 2008, ICML '08.

[48]  Lih-Yuan Deng,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning , 2006, Technometrics.

[49]  H. Robbins,et al.  Iterated logarithm inequalities. , 1967, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Peter Stone,et al.  An empirical analysis of value function-based and policy search reinforcement learning , 2009, AAMAS.

[51]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[52]  Shie Mannor,et al.  The Cross Entropy Method for Fast Policy Search , 2003, ICML.

[53]  H. Robbins,et al.  Inequalities for the sequence of sample means. , 1967, Proceedings of the National Academy of Sciences of the United States of America.

[54]  Dirk P. Kroese,et al.  Convergence properties of the cross-entropy method for discrete optimization , 2007, Oper. Res. Lett..

[55]  Shai Ben-David,et al.  Learning Distributions by Their Density Levels: A Paradigm for Learning without a Teacher , 1997, J. Comput. Syst. Sci..

[56]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[57]  Hans Ulrich Simon,et al.  General bounds on the number of examples needed for learning probabilistic concepts , 1993, COLT '93.

[58]  David E. Goldberg,et al.  Genetic Algorithms, Tournament Selection, and the Effects of Noise , 1995, Complex Syst..

[59]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[60]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[61]  Pietro Perona,et al.  Inferring Ground Truth from Subjective Labelling of Venus Images , 1994, NIPS.

[62]  H. Robbins Statistical Methods Related to the Law of the Iterated Logarithm , 1970 .

[63]  Jaime G. Carbonell,et al.  Efficiently learning the accuracy of labeling sources for selective sampling , 2009, KDD.

[64]  Mark W. Schmidt,et al.  Modeling annotator expertise: Learning when everybody knows a bit of something , 2010, AISTATS.

[65]  Dimitri P. Bertsekas,et al.  Temporal Dierences-Based Policy Iteration and Applications in Neuro-Dynamic Programming 1 , 1997 .

[66]  J. Carbonell,et al.  Adaptive Proactive Learning with Cost-Reliability Tradeoff , 2009 .

[67]  Marin Kobilarov,et al.  Cross-Entropy Randomized Motion Planning , 2011, Robotics: Science and Systems.

[68]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[69]  Olivier Sigaud,et al.  Path Integral Policy Improvement with Covariance Matrix Adaptation , 2012, ICML.

[70]  Reuven Y. Rubinstein,et al.  Optimization of computer simulation models with rare events , 1997 .

[71]  L. Margolin,et al.  On the Convergence of the Cross-Entropy Method , 2005, Ann. Oper. Res..