Approximate Linear Programming for Logistic Markov Decision Processes

Online and mobile interactions with users, in areas such as advertising and product or content recommendation, have been transformed by machine learning techniques. However, such methods have largely focused on myopic prediction, i.e., predicting immediate user response to system actions (e.g., ads or recommendations), without explicitly accounting for the long-term impact on user behavior, nor the potential need for planning action sequences. In this work, we propose the use of Markov decision processes (MDPs) to formulate this long-term decision problem and address two key questions that emerge in their application to user interaction. The first focuses on model formulation, specifically, how best to construct MDP models of user interaction in a way that exploits the great successes of myopic prediction models. To this end, we propose a new model called logistic MDPs, an MDP formulation that allows the concise specification of transition dynamics. It does so by augmenting the natural factored form of dynamic Bayesian networks (DBNs) with user response variables that are captured by a logistic regression model (the latter being precisely the model used for myopic user interaction). The second question we address is how best to solve large logistic MDPs of this type. A variety of methods have been proposed for solving MDPs that exploit the conditional independence reflected in the DBN representations, including approximate linear programming (ALP). Despite their compact form, logistic MDPs do not admit the same conditional independence as DBNs, nor do they satisfy the linearity requirements for standard ALP. We propose a constraint generation approach to ALP for logistic MDPs that circumvents these problems by: (a) recovering compactness by conditioning on the logistic response variable; and (b) devising two procedures, one exact and one approximate, that linearize the search for violated constraints in the master LP. For the approximation procedure, we also derive error bounds on the quality of the induced policy. We demonstrate the effectiveness of our approach on advertising problems with up to several thousand sparse binarized features (up to 2 states and 2 actions). ∗A shorter version of this paper appeared in The Proceedings of the The 26th International Joint Conference on Artificial Intelligence (IJCAI-17), Melbourne, Aug. 2017. †This work was performed while the author was a visiting intern at Google. ‡This work was performed while the author was a visiting scholar at Google.

[1]  David Silver,et al.  Concurrent Reinforcement Learning from Customer Interactions , 2013, ICML.

[2]  Craig Boutilier,et al.  Budget Allocation using Weakly Coupled, Constrained Markov Decision Processes , 2016, UAI.

[3]  Luc De Raedt,et al.  Bellman goes relational , 2004, ICML.

[4]  Alex Beutel,et al.  Recurrent Recommender Networks , 2017, WSDM.

[5]  Keiji Kanazawa,et al.  A model for reasoning about persistence and causation , 1989 .

[6]  Shobha Venkataraman,et al.  Efficient Solution Algorithms for Factored MDPs , 2003, J. Artif. Intell. Res..

[7]  Burkhardt Funk,et al.  To Bid or Not To Bid? Investigating Retail-Brand Keyword Performance in Sponsored Search Advertising , 2011, ICETE.

[8]  Martin Wattenberg,et al.  Ad click prediction: a view from the trenches , 2013, KDD.

[9]  Benjamin Van Roy,et al.  On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming , 2004, Math. Oper. Res..

[10]  Rómer Rosales,et al.  Simple and Scalable Response Prediction for Display Advertising , 2014, ACM Trans. Intell. Syst. Technol..

[11]  Dale Schuurmans,et al.  Direct value-approximation for factored MDPs , 2001, NIPS.

[12]  Lukás Chrpa,et al.  The 2014 International Planning Competition: Progress and Trends , 2015, AI Mag..

[13]  Joaquin Quiñonero Candela,et al.  Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.

[14]  Matthew Richardson,et al.  Predicting clicks: estimating the click-through rate for new ads , 2007, WWW '07.

[15]  Jesse Hoey,et al.  APRICODD: Approximate Policy Construction Using Decision Diagrams , 2000, NIPS.

[16]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[17]  Roberto J. Bayardo,et al.  MapReduce and Its Application to Massively Parallel Learning of Decision Tree Ensembles , 2011 .

[18]  Paul Covington,et al.  Deep Neural Networks for YouTube Recommendations , 2016, RecSys.

[19]  Philip S. Thomas,et al.  Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees , 2015, IJCAI.

[20]  Guy Shani,et al.  An MDP-Based Recommender System , 2002, J. Mach. Learn. Res..

[21]  Craig Boutilier,et al.  Exploiting Structure in Policy Construction , 1995, IJCAI.

[22]  Vahab S. Mirrokni,et al.  Mining advertiser-specific user behavior using adfactors , 2010, WWW '10.

[23]  Xavier Boyen,et al.  Tractable Inference for Complex Stochastic Processes , 1998, UAI.

[24]  Julian J. McAuley,et al.  Fusing Similarity Models with Markov Chains for Sparse Sequential Recommendation , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[25]  Roni Khardon,et al.  First Order Decision Diagrams for Relational MDPs , 2007, IJCAI.

[26]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[27]  Olivier Chapelle,et al.  A dynamic bayesian network click model for web search ranking , 2009, WWW '09.

[28]  Anton Schwaighofer,et al.  Budget Optimization for Sponsored Search: Censored Learning in MDPs , 2012, UAI.

[29]  Diane Tang,et al.  Focusing on the Long-term: It's Good for Users and Business , 2015, KDD.

[30]  Param Vir Singh,et al.  A Hidden Markov Model for Collaborative Filtering , 2010, MIS Q..

[31]  Ilya Trofimov,et al.  Using boosted trees for click-through rate prediction for sponsored search , 2012, ADKDD '12.

[32]  Zheng Chen,et al.  A Markov chain model for integrating behavioral targeting into contextual advertising , 2009, KDD Workshop on Data Mining and Audience Intelligence for Advertising.

[33]  Saeed Shiry Ghidary,et al.  Usage-based web recommendations: a reinforcement learning approach , 2007, RecSys '07.

[34]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[35]  Ravi Kumar,et al.  On targeting Markov segments , 1999, STOC '99.

[36]  Lars Schmidt-Thieme,et al.  Factorizing personalized Markov chains for next-basket recommendation , 2010, WWW '10.

[37]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[38]  Scott Sanner,et al.  Practical solution techniques for first-order MDPs , 2009, Artif. Intell..

[39]  Heng-Tze Cheng,et al.  Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.

[40]  Jesse Hoey,et al.  SPUDD: Stochastic Planning using Decision Diagrams , 1999, UAI.

[41]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[42]  Vahab S. Mirrokni,et al.  Budget Optimization for Online Campaigns with Positive Carryover Effects , 2012, WINE.

[43]  Yong Liu,et al.  Improved Recurrent Neural Networks for Session-based Recommendations , 2016, DLRS@RecSys.

[44]  Jon Feldman,et al.  Budget optimization in search-based advertising auctions , 2006, EC '07.

[45]  Feng Yu,et al.  A Convolutional Click Prediction Model , 2015, CIKM.

[46]  Yinyu Ye,et al.  The Simplex and Policy-Iteration Methods Are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate , 2011, Math. Oper. Res..

[47]  Mykel J. Kochenderfer,et al.  Exploiting Anonymity in Approximate Linear Programming: Scaling to Large Multiagent MDPs , 2015, AAAI.