Probabilistic first pass retrieval for search advertising: from theory to practice

Information retrieval in search advertising, as in other ad-hoc retrieval tasks, aims to find the most appropriate ranking of the ad documents of a corpus for a given query. In addition to ranking the ad documents, we also need to filter or threshold irrelevant ads from participating in the auction to be displayed alongside search results. In this work, we describe our experience in implementing a successful ad retrieval system for a commercial search engine based on the Language Modeling (LM) framework for retrieval. The LM demonstrates significant performance improvements over the baseline vector space model (TF-IDF) system that was in production at the time. From a modeling perspective, we propose a novel approach to incorporate query segmentation and phrases in the LM framework, discuss impact of score normalization for relevance filtering, and present preliminary results of incorporating query expansions using query rewriting techniques. From an implementation perspective, we also discuss real-time latency constraints of a production search engine and how we overcome them by adapting the WAND algorithm to work with language models. In sum, our LM formulation is considerably better in terms of accuracy metrics such as Precision-Recall (10% improvement in AUC) and nDCG (8% improvement in nDCG@5) on editorial data and also demonstrates significant improvements in clicks in live user tests (0.787% improvement in Click Yield, with 8% coverage increase). Finally, we hope that this paper provides the reader with adequate insights into the challenges of building a system that serves millions of users every day.

[1]  Andrei Z. Broder,et al.  Automatic generation of bid phrases for online advertising , 2010, WSDM '10.

[2]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[3]  Andrei Z. Broder,et al.  Online expansion of rare queries for sponsored search , 2009, WWW '09.

[4]  W. Bruce Croft,et al.  Improving the effectiveness of information retrieval with local context analysis , 2000, TOIS.

[5]  Susan T. Dumais,et al.  Similarity Measures for Short Segments of Text , 2007, ECIR.

[6]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[7]  Xiao Li,et al.  Extracting structured information from user queries with semi-supervised conditional random fields , 2009, SIGIR.

[8]  Hema Raghavan Evaluating Vector-Space and Probabilistic Models for Query to Ad Matching , 2008 .

[9]  Tasos Anastasakos,et al.  A collaborative filtering approach to ad recommendation using the query-ad click graph , 2009, CIKM.

[10]  Hema Raghavan,et al.  Improving ad relevance in sponsored search , 2010, WSDM '10.

[11]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[12]  Ricardo Baeza-Yates,et al.  A Comparison of Open Source Search Engines , 2007 .

[13]  Vassilis Plachouras,et al.  A noisy-channel approach to contextual advertising , 2007, ADKDD '07.

[14]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[15]  Joshua Goodman,et al.  Finding advertising keywords on web pages , 2006, WWW '06.

[16]  Bernard J. Jansen,et al.  Examining Searcher Perceptions of and Interactions with Sponsored Results , 2005 .

[17]  James Allan,et al.  Regression Rank: Learning to Meet the Opportunity of Descriptive Queries , 2009, ECIR.

[18]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[19]  Wessel Kraaij,et al.  A Language Modeling Approach to Tracking News Events , 2000 .

[20]  Andrei Z. Broder,et al.  Robust classification of rare queries using web knowledge , 2007, SIGIR.

[21]  Fuchun Peng,et al.  Unsupervised query segmentation using generative language models and wikipedia , 2008, WWW.

[22]  S. Robertson The probability ranking principle in IR , 1997 .

[23]  Andrei Z. Broder,et al.  Search advertising using web relevance feedback , 2008, CIKM '08.

[24]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[25]  W. Bruce Croft,et al.  Efficient document retrieval in main memory , 2007, SIGIR.

[26]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[27]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[28]  W. Bruce Croft,et al.  Statistical language modeling for information retrieval , 2006, Annu. Rev. Inf. Sci. Technol..

[29]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[30]  Berthier A. Ribeiro-Neto,et al.  Impedance coupling in content-targeted advertising , 2005, SIGIR '05.

[31]  Matthew Richardson,et al.  Predicting clicks: estimating the click-through rate for new ads , 2007, WWW '07.