Precomputing search features for fast and accurate query classification

Query intent classification is crucial for web search and advertising. It is known to be challenging because web queries contain less than three words on average, and so provide little signal to base classification decisions on. At the same time, the vocabulary used in search queries is vast: thus, classifiers based on word-occurrence have to deal with a very sparse feature space, and often require large amounts of training data. Prior efforts to address the issue of feature sparseness augmented the feature space using features computed from the results obtained by issuing the query to be classified against a web search engine. However, these approaches induce high latency, making them unacceptable in practice. In this paper, we propose a new class of features that realizes the benefit of search-based features without high latency. These leverage co-occurrence between the query keywords and tags applied to documents in search results, resulting in a significant boost to web query classification accuracy. By pre-computing the tag incidence for a suitably chosen set of keyword-combinations, we are able to generate the features online with low latency and memory requirements. We evaluate the accuracy of our approach using a large corpus of real web queries in the context of commercial search.

[1]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[2]  Christopher Meek,et al.  Improving Similarity Measures for Short Segments of Text , 2007, AAAI.

[3]  Fernando Diaz,et al.  Sources of evidence for vertical selection , 2009, SIGIR.

[4]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[5]  Jianfeng Gao,et al.  Ranking, Boosting, and Model Adaptation , 2008 .

[6]  Geoffrey Sampson,et al.  Word frequency distributions , 2002, Computational Linguistics.

[7]  Andrei Z. Broder,et al.  Search advertising using web relevance feedback , 2008, CIKM '08.

[8]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[9]  Xiao Li,et al.  Learning query intent from regularized click graphs , 2008, SIGIR '08.

[10]  Ophir Frieder,et al.  Improving automatic query classification via semi-supervised learning , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[11]  Qiang Yang,et al.  Building bridges for web query classification , 2006, SIGIR.

[12]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[13]  Bart Goethals,et al.  Advances in frequent itemset mining implementations: report on FIMI'03 , 2004, SKDD.

[14]  Paul N. Bennett,et al.  Estimating query performance using class predictions , 2009, SIGIR.

[15]  Ariel Fuxman,et al.  Improving classification accuracy using automatically extracted training data , 2009, KDD.

[16]  Jinwen Ma,et al.  Query topic detection for reformulation , 2007, WWW '07.

[17]  Susan T. Dumais,et al.  Similarity Measures for Short Segments of Text , 2007, ECIR.

[18]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[19]  Kenneth Ward Church,et al.  Heavy-tailed distributions and multi-keyword queries , 2007, SIGIR.

[20]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[21]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[22]  Andrei Z. Broder,et al.  Robust classification of rare queries using web knowledge , 2007, SIGIR.

[23]  Kenneth Ward Church,et al.  A Data Structure for Sponsored Search , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[24]  Thorsten Brants,et al.  Randomized Language Models via Perfect Hash Functions , 2008, ACL.