Ad Recommendation Systems for Life-Time Value Optimization

The main objective in the ad recommendation problem is to find a strategy that, for each visitor of the website, selects the ad that has the highest probability of being clicked. This strategy could be computed using supervised learning or contextual bandit algorithms, which treat two visits of the same user as two separate independent visitors, and thus, optimize greedily for a single step into the future. Another approach would be to use reinforcement learning (RL) methods, which differentiate between two visits of the same user and two different visitors, and thus, optimizes for multiple steps into the future or the life-time value (LTV) of a customer. While greedy methods have been well-studied, the LTV approach is still in its infancy, mainly due to two fundamental challenges: how to compute a good LTV strategy and how to evaluate a solution using historical data to ensure its "safety" before deployment. In this paper, we tackle both of these challenges by proposing to use a family of off-policy evaluation techniques with statistical guarantees about the performance of a new strategy. We apply these methods to a real ad recommendation problem, both for evaluating the final performance and for optimizing the parameters of the RL algorithm. Our results show that our LTV optimization algorithm equipped with these off-policy evaluation techniques outperforms the greedy approaches. They also give fundamental insights on the difference between the click through rate (CTR) and LTV metrics for performance evaluation in the ad recommendation problem.

[1]  B. Efron Better Bootstrap Confidence Intervals , 1987 .

[2]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[3]  P. Pfeifer,et al.  Modeling customer relationships as Markov chains , 2000 .

[4]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[5]  J Carpenter,et al.  Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. , 2000, Statistics in medicine.

[6]  Naoki Abe,et al.  Sequential cost-sensitive decision making with reinforcement learning , 2002, KDD.

[7]  A. Folsom,et al.  Coronary heart disease risk prediction in the Atherosclerosis Risk in Communities (ARIC) study. , 2003, Journal of clinical epidemiology.

[8]  J. Pankow,et al.  Prediction of coronary heart disease in middle-aged adults with diabetes. , 2003, Diabetes care.

[9]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  Dirk Van den Poel,et al.  Joint optimization of customer segmentation and marketing policy to maximize long-term profitability , 2002, Expert Syst. Appl..

[12]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[13]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14]  André Elisseeff,et al.  The 2005 ISMS Practice Prize Winner---Customer Equity and Lifetime Management (CELM) Finnair Case Study , 2007 .

[15]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[16]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[17]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[18]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[19]  David Silver,et al.  Concurrent Reinforcement Learning from Customer Interactions , 2013, ICML.

[20]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.