Risk-averse trees for learning from logged bandit feedback

Logged data is one of the most widespread form of recorded information, since it can be acquired by almost any system and stored at a little cost. Customarily, the interaction logs between the system and a user (or environment) present the structure of a sequential decision process: given a context, the system performs an action and the user provides a feedback about it. This structure is common to a wide range of real-world micro-economic applications, e.g., e-commerce websites and advertisement campaigns. The problem of learning a policy from such logged interactions to take more profitable decisions in the future is known as the Learning from Logged Bandit Feedback (LLBF) problem. In this paper, we propose RADT, an algorithm specifically shaped for the LLBF setting and based on a risk-averse learning method which exploits the joint use of regression trees and statistical confidence bounds. Differently from existing techniques developed for this setting, RADT generates policies aiming to maximize a lower bound on the expected reward and provides a clear characterization of those features in the context that influence the process the most. Finally, we provide a wide experimental campaign over both synthetic and real-world datasets showing empirical evidence that RADT outperforms both state-of-the-art machine learning classification and regression techniques and existing methods addressing the LLBF setting.

[1]  Thorsten Joachims,et al.  Multi-armed Bandit Problems with History , 2012, AISTATS.

[2]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[3]  Thorsten Joachims,et al.  Counterfactual Risk Minimization , 2015, ICML.

[4]  Igor Kononenko,et al.  Machine Learning and Data Mining: Introduction to Principles and Algorithms , 2007 .

[5]  Liang Tang,et al.  Automatic ad format selection via contextual bandits , 2013, CIKM.

[6]  Thorsten Joachims,et al.  Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[7]  N. Gatti,et al.  Multi – Armed Bandit for Pricing , 2015 .

[8]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[9]  Raphaël Féraud,et al.  Random Forest for the Contextual Bandit Problem , 2015, AISTATS.

[10]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[11]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[12]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[13]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[14]  Alessandro Lazaric,et al.  Risk-Aversion in Multi-armed Bandits , 2012, NIPS.

[15]  Saso Dzeroski,et al.  Learning model trees from evolving data streams , 2010, Data Mining and Knowledge Discovery.

[16]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[18]  Thorsten Joachims,et al.  Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[19]  John Langford,et al.  The offset tree for learning with partial labels , 2008, KDD.

[20]  Andreas Holzinger,et al.  Data Mining with Decision Trees: Theory and Applications , 2015, Online Inf. Rev..

[21]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[22]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[23]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[24]  Nicolò Cesa-Bianchi,et al.  Splitting with confidence in decision trees with application to stream mining , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).