Residual Overfit Method of Exploration

Exploration is a crucial aspect of bandit and reinforcement learning algorithms. The uncertainty quantification necessary for exploration often comes from either closed-form expressions based on simple models or resampling and posterior approximations that are computationally intensive. We propose instead an approximate exploration methodology based on fitting only two point estimates, one tuned and one overfit. The approach, which we term the residual overfit method of exploration (Rome), drives exploration towards actions where the overfit model exhibits the most overfitting compared to the tuned model. The intuition is that overfitting occurs the most at actions and contexts with insufficient data to form accurate predictions of the reward. We justify this intuition formally from both a frequentist and a Bayesian information theoretic perspective. The result is a method that generalizes to a wide variety of models and avoids the computational overhead of resampling or posterior approximations. We compare Rome against a set of established contextual bandit methods on three datasets and find it to be one of the best performing.

[1]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[2]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[3]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[4]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[5]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[6]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[7]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[8]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[9]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[10]  Zoubin Ghahramani,et al.  Bayesian Active Learning for Classification and Preference Learning , 2011, ArXiv.

[11]  Jürgen Schmidhuber,et al.  Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[12]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[13]  David Rohde,et al.  Learning from Bandit Feedback: An Overview of the State-of-the-art , 2019, ArXiv.

[14]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[15]  Andreas Krause,et al.  Explore-exploit in top-N recommender systems via Gaussian processes , 2014, RecSys '14.

[16]  Michael I. Jordan Graphical Models , 2003 .

[17]  Pierre Baldi,et al.  Understanding Dropout , 2013, NIPS.

[18]  James McInerney,et al.  Explore, exploit, and explain: personalizing explainable recommendations with bandits , 2018, RecSys.

[19]  Flavian Vasile,et al.  BLOB: A Probabilistic Model for Recommendation that Combines Organic and Bandit Signals , 2020, KDD.

[20]  Lutz Prechelt,et al.  Early Stopping - But When? , 2012, Neural Networks: Tricks of the Trade.

[21]  Claudio Gentile,et al.  Boltzmann Exploration Done Right , 2017, NIPS.

[22]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[23]  Rich Caruana,et al.  Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping , 2000, NIPS.

[24]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[25]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[26]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[27]  D. Lindley On a Measure of the Information Provided by an Experiment , 1956 .

[28]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[29]  Jasper Snoek,et al.  Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling , 2018, ICLR.

[30]  Prabhat,et al.  Scalable Bayesian Optimization Using Deep Neural Networks , 2015, ICML.

[31]  Christopher Lawrence,et al.  Explore , 2010, The Lancet.

[32]  Wei Li,et al.  Exploitation and exploration in a performance based contextual advertising system , 2010, KDD.