Discovering key concepts in verbose queries

Current search engines do not, in general, perform well with longer, more verbose queries. One of the main issues in processing these queries is identifying the key concepts that will have the most impact on effectiveness. In this paper, we develop and evaluate a technique that uses query-dependent, corpus-dependent, and corpus-independent features for automatic extraction of key concepts from verbose queries. We show that our method achieves higher accuracy in the identification of key concepts than standard weighting methods such as inverse document frequency. Finally, we propose a probabilistic model for integrating the weighted key concepts identified by our method into a query, and demonstrate that this integration significantly improves retrieval effectiveness for a large set of natural language description queries derived from TREC topics on several newswire and web collections.

[1]  Emanuele Pianta,et al.  Beyond Lexical Units: Enriching WordNets with Phrasets , 2003, EACL.

[2]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[3]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[4]  James Allan,et al.  INQUERY at TREC-5 , 1996, TREC.

[5]  Claire Cardie,et al.  Using clustering and SuperConcepts within SMART: TREC 6 , 1997, Inf. Process. Manag..

[6]  W. Bruce Croft,et al.  Indri: A language-model based search engine for complex queries1 , 2005 .

[7]  Ian Witten,et al.  Data Mining , 2000 .

[8]  José Gabriel Pereira Lopes,et al.  Document clustering and cluster topic extraction in multilingual corpora , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[9]  W. Bruce Croft,et al.  Query performance prediction in web search environments , 2007, SIGIR.

[10]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[11]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[12]  Joshua Goodman,et al.  Finding advertising keywords on web pages , 2006, WWW '06.

[13]  Daniel Marcu,et al.  Statistics-Based Summarization - Step One: Sentence Compression , 2000, AAAI/IAAI.

[14]  Kevyn Collins-Thompson,et al.  Query expansion using random walk models , 2005, CIKM '05.

[15]  Kenneth Ward Church,et al.  Poisson mixtures , 1995, Natural Language Engineering.

[16]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[17]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[18]  Oren Kurland,et al.  Corpus structure, language models, and ad hoc information retrieval , 2004, SIGIR '04.

[19]  W. Bruce Croft,et al.  Latent concept expansion using markov random fields , 2007, SIGIR.

[20]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[21]  BuckleyChris,et al.  Using clustering and SuperConcepts within SMART , 2000 .

[22]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[23]  W. Bruce Croft,et al.  TREC and Tipster Experiments with Inquery , 1995, Inf. Process. Manag..

[24]  James Allan,et al.  A Case For Shorter Queries, and Helping Users Create Them , 2007, NAACL.

[25]  Carl Gutwin,et al.  Domain-Specific Keyphrase Extraction , 1999, IJCAI.

[26]  James P. Callan,et al.  Experiments Using the Lemur Toolkit , 2001, TREC.

[27]  ChengXiang Zhai,et al.  A study of Poisson query generation model for information retrieval , 2007, SIGIR.

[28]  James Allan,et al.  INQUERY and TREC-8 , 1998, TREC.

[29]  Ellen M. Voorhees,et al.  The Ninth Text REtrieval Conference (TREC-9) , 2001 .

[30]  Djoerd Hiemstra,et al.  Term-specific smoothing for the language modeling approach to information retrieval: the importance of a query term , 2002, SIGIR '02.

[31]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[32]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.