Decision-Making Under Selective Labels: Optimal Finite-Domain Policies and Beyond

Selective labels are a common feature of highstakes decision-making applications, referring to the lack of observed outcomes under one of the possible decisions. This paper studies the learning of decision policies in the face of selective labels, in an online setting that balances learning costs against future utility. In the homogeneous case in which individuals’ features are disregarded, the optimal decision policy is shown to be a threshold policy. The threshold becomes more stringent as more labels are collected; the rate at which this occurs is characterized. In the case of features drawn from a finite domain, the optimal policy consists of multiple homogeneous policies in parallel. For the general infinite-domain case, the homogeneous policy is extended by using a probabilistic classifier and bootstrapping to provide its inputs. In experiments on synthetic and real data, the proposed policies achieve consistently superior utility with no parameter tuning in the finite-domain case and lower parameter sensitivity in the general case.

[1]  Thorsten Joachims,et al.  Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[2]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[3]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[4]  Jure Leskovec,et al.  The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables , 2017, KDD.

[5]  Alexandra Chouldechova,et al.  Learning under selective labels in the presence of expert consistency , 2018, ArXiv.

[6]  Aaron Roth,et al.  Fairness in Learning: Classic and Contextual Bandits , 2016, NIPS.

[7]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[8]  Mingyan Liu,et al.  How Do Fair Decisions Fare in Long-term Qualification? , 2020, NeurIPS.

[9]  Dean Eckles,et al.  Thompson sampling with the online bootstrap , 2014, ArXiv.

[10]  Nathan Kallus,et al.  Balanced Policy Evaluation and Learning , 2017, NeurIPS.

[11]  Benjamin Van Roy,et al.  Bootstrapped Thompson Sampling and Deep Exploration , 2015, ArXiv.

[12]  Nathan Kallus,et al.  Residual Unfairness in Fair Machine Learning from Prejudiced Data , 2018, ICML.

[13]  Stefan Wager,et al.  Policy Learning With Observational Data , 2017, Econometrica.

[14]  Krikamol Muandet,et al.  Fair Decisions Despite Imperfect Predictions , 2019, AISTATS.

[15]  Percy Liang,et al.  Fairness Without Demographics in Repeated Loss Minimization , 2018, ICML.

[16]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[17]  Sampath Kannan,et al.  A Smoothed Analysis of the Greedy Algorithm for the Linear Contextual Bandit Problem , 2018, NeurIPS.

[18]  Lihong Li,et al.  Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[19]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[20]  Toniann Pitassi,et al.  Causal Modeling for Fairness in Dynamical Systems , 2019, ICML.

[21]  Yiling Chen,et al.  A Short-term Intervention for Long-term Fairness in the Labor Market , 2017, WWW.

[22]  Krishna P. Gummadi,et al.  On the Long-term Impact of Algorithmic Decision Policies: Effort Unfairness and Feature Segregation through Social Learning , 2019, ICML.

[23]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[24]  Suresh Venkatasubramanian,et al.  Runaway Feedback Loops in Predictive Policing , 2017, FAT.

[25]  Celestine Mendler-Dünner,et al.  Performative Prediction , 2020, ICML.

[26]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[27]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[28]  Nathan Srebro,et al.  From Fair Decision Making To Social Equality , 2018, FAT.

[29]  Manuel Gomez-Rodriguez,et al.  Decisions, Counterfactual Explanations and Strategic Behavior , 2020, NeurIPS.

[30]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[31]  David C. Parkes,et al.  From Predictions to Decisions: Using Lookahead Regularization , 2020, NeurIPS.

[32]  R. Weber On the Gittins Index for Multiarmed Bandits , 1992 .

[33]  Aaron Roth,et al.  Equal Opportunity in Online Classification with Partial Feedback , 2019, NeurIPS.

[34]  Alexandra Chouldechova,et al.  Counterfactual risk assessments, evaluation, and fairness , 2020, FAT*.

[35]  Khashayar Khosravi,et al.  Mostly Exploration-Free Algorithms for Contextual Bandits , 2017, Manag. Sci..

[36]  Avi Feller,et al.  Algorithmic Decision Making and the Cost of Fairness , 2017, KDD.

[37]  Esther Rolf,et al.  Delayed Impact of Fair Machine Learning , 2018, ICML.

[38]  John Langford,et al.  A Contextual Bandit Bake-off , 2018, J. Mach. Learn. Res..