A Smoothed Analysis of the Greedy Algorithm for the Linear Contextual Bandit Problem

Bandit learning is characterized by the tension between long-term exploration and short-term exploitation. However, as has recently been noted, in settings in which the choices of the learning algorithm correspond to important decisions about individual people (such as criminal recidivism prediction, lending, and sequential drug trials), exploration corresponds to explicitly sacrificing the well-being of one individual for the potential future benefit of others. This raises a fairness concern. In such settings, one might like to run a "greedy" algorithm, which always makes the (myopically) optimal decision for the individuals at hand - but doing this can result in a catastrophic failure to learn. In this paper, we consider the linear contextual bandit problem and revisit the performance of the greedy algorithm. We give a smoothed analysis, showing that even when contexts may be chosen by an adversary, small perturbations of the adversary's choices suffice for the algorithm to achieve "no regret", perhaps (depending on the specifics of the setting) with a constant amount of initial training data. This suggests that "generically" (i.e. in slightly perturbed environments), exploration and exploitation need not be in conflict in the linear setting.

[1]  Adam Tauman Kalai,et al.  Learning and Smoothed Analysis , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[2]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[3]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[4]  Seth Neel,et al.  Fair Algorithms for Infinite and Contextual Bandits , 2016, 1610.09559.

[5]  Akshay Krishnamurthy,et al.  Efficient Algorithms for Adversarial Contextual Learning , 2016, ICML.

[6]  Yang Liu,et al.  Calibrated Fairness in Bandits , 2017, ArXiv.

[7]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[8]  Seth Neel,et al.  Rawlsian Fairness for Machine Learning , 2016, ArXiv.

[9]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[10]  John Langford,et al.  Practical Evaluation and Optimization of Contextual Bandit Algorithms , 2018, ArXiv.

[11]  John Dunagan,et al.  Smoothed analysis of the perceptron algorithm for linear programming , 2002, SODA '02.

[12]  Aaron Roth,et al.  Fairness in Learning: Classic and Contextual Bandits , 2016, NIPS.

[13]  Bodo Manthey,et al.  Smoothed Analysis of the k-Means Method , 2011, JACM.

[14]  Khashayar Khosravi,et al.  Exploiting the Natural Exploration In Contextual Bandits , 2017, ArXiv.

[15]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[16]  Tong Wang,et al.  Learning to Detect Patterns of Crime , 2013, ECML/PKDD.

[17]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[18]  Suresh Venkatasubramanian,et al.  Runaway Feedback Loops in Predictive Policing , 2017, FAT.

[19]  Sampath Kannan,et al.  Fairness Incentives for Myopic Agents , 2017, EC.

[20]  Aaron Roth,et al.  Fairness in Reinforcement Learning , 2016, ICML.

[21]  P. Massart,et al.  Adaptive estimation of a quadratic functional by model selection , 2000 .

[22]  Zhiwei Steven Wu,et al.  The Externalities of Exploration and How Data Diversity Helps Exploitation , 2018, COLT.

[23]  D. Spielman,et al.  Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time , 2004 .

[24]  Aditya Bhaskara,et al.  Smoothed analysis of tensor decompositions , 2013, STOC.