Online Inference for Advertising Auctions

Advertisers that engage in real-time bidding (RTB) to display their ads commonly have two goals: learning their optimal bidding policy and estimating the expected effect of exposing users to their ads. Typical strategies to accomplish one of these goals tend to ignore the other, creating an apparent tension between the two. This paper exploits the economic structure of the bid optimization problem faced by advertisers to show that these two objectives can actually be perfectly aligned. By framing the advertiser's problem as a multi-armed bandit (MAB) problem, we propose a modified Thompson Sampling (TS) algorithm that concurrently learns the optimal bidding policy and estimates the expected effect of displaying the ad while minimizing economic losses from potential sub-optimal bidding. Simulations show that not only the proposed method successfully accomplishes the advertiser's goals, but also does so at a much lower cost than more conventional experimentation policies aimed at performing causal inference.

[1]  Xinkun Nie,et al.  Why adaptively collected data have negative bias and how to correct for it , 2017, AISTATS.

[2]  Eric M. Schwartz,et al.  Dynamic Online Pricing with Incomplete Information Using Multi-Armed Bandit Experiments , 2018, Mark. Sci..

[3]  Austin Daniel,et al.  Reserve Price Optimization at Scale , 2016 .

[4]  Z. Dienes How Bayes factors change scientific practice , 2016 .

[5]  Vasilis Syrgkanis,et al.  Accurate Inference for Adaptive Linear Models , 2017, ICML.

[6]  Lalit Jain,et al.  A Bandit Approach to Sequential Experimental Design with False Discovery Control , 2018, NeurIPS.

[7]  L. J. Savage,et al.  The Foundations of Statistics , 1955 .

[8]  Luke Bornn,et al.  Sequential Monte Carlo Bandits , 2013, ArXiv.

[9]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[10]  Liangjie Hong,et al.  A Sequential Test for Selecting the Better Variant: Online A/B testing, Adaptive Allocation, and Continuous Monitoring , 2019, WSDM.

[11]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[12]  Narayanan Sadagopan,et al.  Contextual Multi-Armed Bandits for Causal Marketing , 2018, ArXiv.

[13]  Elias Bareinboim,et al.  Bandits with Unobserved Confounders: A Causal Approach , 2015, NIPS.

[14]  Edoardo M. Airoldi,et al.  Optimizing Cluster-based Randomized Experiments under Monotonicity , 2018, KDD.

[15]  Nathan Kallus,et al.  Instrument-Armed Bandits , 2017, ALT.

[16]  Susan Athey,et al.  Estimation Considerations in Contextual Bandits , 2017, ArXiv.

[17]  H. Varian Online Ad Auctions , 2009 .

[18]  Yu Wang,et al.  LADDER: A Human-Level Bidding Agent for Large-Scale Real-Time Online Auctions , 2017, ArXiv.

[19]  Jie W Weiss,et al.  Bayesian Statistical Inference for Psychological Research , 2008 .

[20]  D. Lindley A STATISTICAL PARADOX , 1957 .

[21]  Jeffrey Wong,et al.  Incrementality Bidding & Attribution , 2018 .

[22]  Jun Wang,et al.  Real-time bidding for online advertising: measurement and analysis , 2013, ADKDD '13.

[23]  Rick P. Thomas,et al.  When decision heuristics and science collide , 2013, Psychonomic Bulletin & Review.

[24]  T. Amemiya Tobit models: A survey , 1984 .

[25]  Dale J. Poirier,et al.  Learning about the across-regime correlation in switching regression models , 1997 .

[26]  Benjamin Van Roy,et al.  A Tutorial on Thompson Sampling , 2017, Found. Trends Mach. Learn..

[27]  Shuchi Chawla,et al.  A/B Testing of Auctions , 2016, EC.

[28]  Nathan Kallus,et al.  Balanced Policy Evaluation and Learning , 2017, NeurIPS.

[29]  Maximilian Kasy,et al.  Adaptive Treatment Assignment in Experiments for Policy Choice , 2019, Econometrica.

[30]  Carl F. Mela,et al.  Online Display Advertising Markets: A Literature Review and Future Directions , 2019, Inf. Syst. Res..

[31]  Gary Koop,et al.  Bayesian Econometric Methods , 2007 .

[32]  Thorsten Joachims,et al.  Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[33]  J. Rouder Optional stopping: No problem for Bayesians , 2014, Psychonomic bulletin & review.

[34]  Di Wu,et al.  A Multi-Agent Reinforcement Learning Method for Impression Allocation in Online Display Advertising , 2018, ArXiv.

[35]  Peter Grünwald,et al.  Optional Stopping with Bayes Factors: a categorization and extension of folklore results, with an application to invariant situations , 2018, ArXiv.

[36]  Stefan Wager,et al.  Policy Learning With Observational Data , 2017, Econometrica.

[37]  Illtyd Trethowan Causality , 1938 .

[38]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[39]  Jack Bowden,et al.  Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges. , 2015, Statistical science : a review journal of the Institute of Mathematical Statistics.

[40]  Tor Lattimore,et al.  Causal Bandits: Learning Good Interventions via Causal Inference , 2016, NIPS.

[41]  Peter E. Rossi,et al.  Bayesian Statistics and Marketing , 2005 .

[42]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[43]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[44]  Alp Akcay,et al.  Optimizing reserve prices for publishers in online ad auctions , 2019, 2019 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr).

[45]  Tao Qin,et al.  Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising , 2013, NIPS.

[46]  S. Chib Bayes inference in the Tobit censored regression model , 1992 .

[47]  Thomas T. Hills,et al.  The frequentist implications of optional stopping on Bayesian hypothesis tests , 2013, Psychonomic Bulletin & Review.

[48]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[49]  Ron Berman,et al.  Test & Roll: Profit-Maximizing A/B Tests , 2019, Mark. Sci..

[50]  Felix D. Schönbrodt,et al.  Sequential Hypothesis Testing With Bayes Factors: Efficiently Testing Mean Differences , 2017, Psychological methods.

[51]  D. Rubin,et al.  Bayesian inference for causal effects in randomized experiments with noncompliance , 1997 .

[52]  Stefan Wager,et al.  Efficient Policy Learning , 2017, ArXiv.

[53]  Martin J. Wainwright,et al.  A framework for Multi-A(rmed)/B(andit) Testing with Online FDR Control , 2017, NIPS.

[54]  R. Olsen,et al.  Note on the Uniqueness of the Maximum Likelihood Estimator for the Tobit Model , 1978 .

[55]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[56]  L. Pekelis,et al.  Always Valid Inference: Bringing Sequential Analysis to A/B Testing , 2015, 1512.04922.

[57]  M. Mohri,et al.  Bandit Problems , 2006 .

[58]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[59]  Alex Deng,et al.  Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[60]  Michael Ostrovsky,et al.  Reserve Prices in Internet Advertising Auctions: A Field Experiment , 2009, Journal of Political Economy.

[61]  Elias Bareinboim,et al.  Counterfactual Data-Fusion for Online Reinforcement Learners , 2017, ICML.

[62]  C. Glymour,et al.  STATISTICS AND CAUSAL INFERENCE , 1985 .

[63]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[64]  Mohsen Bayati,et al.  Online Decision-Making with High-Dimensional Covariates , 2015 .

[65]  Lihong Li,et al.  Counterfactual Estimation and Optimization of Click Metrics in Search Engines: A Case Study , 2015, WWW.

[66]  Jacob D. Abernethy,et al.  Dynamic Online Pricing with Incomplete Information , 2016 .

[67]  Wim P. M. Vijverberg,et al.  Measuring the unidentified parameter of the extended Roy model of selectivity , 1993 .

[68]  Jun Wang,et al.  Real-Time Bidding by Reinforcement Learning in Display Advertising , 2017, WSDM.

[69]  Weinan Zhang,et al.  Real-Time Bidding with Multi-Agent Reinforcement Learning in Display Advertising , 2018, CIKM.

[70]  P. Grunwald,et al.  Why optional stopping is a problem for Bayesians , 2017, 1708.08278.

[71]  Claudio Gentile,et al.  Ieee Transactions on Information Theory 1 Regret Minimization for Reserve Prices in Second-price Auctions , 2022 .

[72]  A. Zeevi,et al.  A Linear Response Bandit Problem , 2013 .