Improving automatic query classification via semi-supervised learning

Accurate topical classification of user queries allows for increased effectiveness and efficiency in general-purpose Web search systems. Such classification becomes critical if the system is to return results not just from a general Web collection but from topic-specific back-end databases as well. Maintaining sufficient classification recall is very difficult as Web queries are typically short, yielding few features per query. This feature sparseness coupled with the high query volumes typical for a large-scale search service makes manual and supervised learning approaches alone insufficient. We use an application of computational linguistics to develop an approach for mining the vast amount of unlabeled data in Web query logs to improve automatic topical Web query classification. We show that our approach in combination with manual matching and supervised learning allows us to classify a substantially larger proportion of queries than any single technique. We examine the performance of each approach on a real Web query stream and show that our combined method accurately classifies 46% of queries, outperforming the recall of best single approach by nearly 20%, with a 7% improvement in overall effectiveness.

[1]  Lawrence L. Kupper,et al.  How Appropriate are Popular Sample Size Formulas , 1989 .

[2]  Marc Light,et al.  Statistical models for the induction and use of selectional preferences , 2002, Cogn. Sci..

[3]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[4]  Ophir Frieder,et al.  Hourly analysis of a very large topically categorized web query log , 2004, SIGIR '04.

[5]  P. Resnik Selection and information: a class-based approach to lexical relationships , 1993 .

[6]  Ji-Rong Wen,et al.  Query clustering using user logs , 2002, TOIS.

[7]  Jean M. Tague,et al.  The pragmatics of information retrieval experimentation , 1981 .

[8]  Ji-Rong Wen,et al.  Query clustering using content words and user feedback , 2001, SIGIR '01.

[9]  Jean Tague-Sutcliffe,et al.  The Pragmatics of Information Retrieval Experimentation Revisited , 1997, Inf. Process. Manag..

[10]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[11]  Diana McCarthy,et al.  Disambiguating Nouns, Verbs, and Adjectives Using Automatically Acquired Selectional Preferences , 2003, CL.

[12]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[13]  In-Ho Kang,et al.  Query type classification for web document retrieval , 2003, SIGIR.

[14]  T. Macgrath W. A. C. , 1874 .

[15]  Ophir Frieder,et al.  Automatic web query classification using labeled and unlabeled training data , 2005, SIGIR '05.

[16]  Luis Gravano,et al.  Categorizing web queries according to geographical locality , 2003, CIKM '03.

[17]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[18]  Ji-Rong Wen,et al.  Clustering user queries of a search engine , 2001, WWW '01.

[19]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[20]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[21]  W. Krauth,et al.  Learning algorithms with optimal stability in neural networks , 1987 .