Mining Search Engine Clickthrough Log for Matching N-gram Features

User clicks on a URL in response to a query are extremely useful predictors of the URL's relevance to that query. Exact match click features tend to suffer from severe data sparsity issues in web ranking. Such sparsity is particularly pronounced for new URLs or long queries where each distinct query-url pair will rarely occur. To remedy this, we present a set of straightforward yet informative query-url n-gram features that allows for generalization of limited user click data to large amounts of unseen query-url pairs. The method is motivated by techniques leveraged in the NLP community for dealing with unseen words. We find that there are interesting regularities across queries and their preferred destination URLs; for example, queries containing "form" tend to lead to clicks on URLs containing "pdf". We evaluate our set of new query-url features on a web search ranking task and obtain improvements that are statistically significant at a p-value < 0.0001 level over a strong baseline with exact match clickthrough features.

[1]  Nick Craswell,et al.  Random walks on the click graph , 2007, SIGIR.

[2]  Ryen W. White,et al.  Mining the search trails of surfing crowds: identifying relevant websites from user activity , 2008, WWW.

[3]  Hongyuan Zha,et al.  A regression framework for learning ranking functions using relative relevance judgments , 2007, SIGIR.

[4]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[5]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[6]  Benjamin Piwowarski,et al.  A user browsing model to predict search engine click data from past observations. , 2008, SIGIR '08.

[7]  Wei Yuan,et al.  Smoothing clickthrough data for web search ranking , 2009, SIGIR.

[8]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[9]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[10]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[11]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[12]  Xin Li,et al.  Coupling feature selection and machine learning methods for navigational query identification , 2006, CIKM '06.

[13]  Kenneth Ward Church,et al.  Identifying word correspondence in parallel texts , 1991 .

[14]  Edward A. Fox,et al.  A comparison of two methods for boolean query relevancy feedback , 1984, Inf. Process. Manag..

[15]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[16]  Eugene Agichtein,et al.  Identifying "best bet" web search results by mining past user behavior , 2006, KDD '06.

[17]  Eric Brill,et al.  Improving web search ranking by incorporating user behavior information , 2006, SIGIR.

[18]  Filip Radlinski,et al.  Active exploration for learning rankings from clickthrough data , 2007, KDD '07.