Scalable Audience Reach Estimation in Real-Time Online Advertising

Online advertising has been introduced as one of the most efficient methods of advertising throughout the recent years. Yet, advertisers are concerned about the efficiency of their online advertising campaigns and consequently, would like to restrict their ad impressions to certain websites and/or certain groups of audience. These restrictions, known as targeting criteria, limit the reach ability for better performance. This trade-off between reach ability and performance illustrates a need for a forecasting system that can quickly predict/estimate (with good accuracy) this trade-off. Designing such a system is challenging due to (a) the huge amount of data to process, and, (b) the need for fast and accurate estimates. In this paper, we propose a distributed fault tolerant system that can generate such estimates fast with good accuracy. The main idea is to keep a small representative sample in memory across multiple machines and formulate the forecasting problem as queries against the sample. The key challenge is to find the best strata across the past data, perform multivariate stratified sampling while ensuring fuzzy fall-back to cover the small minorities. Our results show a significant improvement over the uniform and simple stratified sampling strategies which are currently widely used in the industry.

[1]  Alfredo Cuzzocrea,et al.  LCS-Hist: taming massive high-dimensional data cube compression , 2009, EDBT '09.

[2]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[3]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[4]  Ali Jalali,et al.  On Learning Discrete Graphical Models using Greedy Methods , 2011, NIPS.

[5]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[6]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[7]  Mong-Li Lee,et al.  ICICLES: Self-Tuning Samples for Approximate Query Answering , 2000, VLDB.

[8]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[9]  W. L. Nicholson,et al.  On the Normal Approximation to the Hypergeometric Distribution , 1956 .

[10]  Allan Kuchinsky,et al.  Integrating user-perceived quality into Web server design , 2000, Comput. Networks.

[11]  Weijia Jia,et al.  Vertex Cover: Further Observations and Further Improvements , 2001, J. Algorithms.

[12]  Viswanath Poosala,et al.  Congressional samples for approximate answering of group-by queries , 2000, SIGMOD '00.

[13]  Jiawei Han,et al.  High-Dimensional OLAP: A Minimal Cubing Approach , 2004, VLDB.

[14]  Minos N. Garofalakis,et al.  Approximate Query Processing: Taming the TeraBytes , 2001, VLDB.

[15]  Weijia Jia,et al.  Vertex Cover: Further Observations and Further Improvements , 1999, J. Algorithms.

[16]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[17]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[18]  Dimitrios Gunopulos,et al.  Approximating multi-dimensional aggregate range queries over real attributes , 2000, SIGMOD '00.

[19]  Chris Jermaine,et al.  Scalable approximate query processing with the DBO engine , 2008, TODS.

[20]  J. Laurie Snell,et al.  Markov Random Fields and Their Applications , 1980 .

[21]  Fan Wang,et al.  Stratified sampling for data mining on the deep web , 2010, 2010 IEEE International Conference on Data Mining.

[22]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[23]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[24]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.