What was the Query? Generating Queries for Document Sets with Applications in Cluster Labeling

We deal with the task of generating a query that retrieves a given set of documents. In its abstract form, this can be seen as a “compression” of the document set to a short query. But the task also has a real-world application: cluster labeling (e.g., for faceted search). Our solution to cluster labeling is the usage of queries that approximately retrieve a cluster’s documents. To be generalizable, our approach does not require access to a search index but only a public interface like an API. This way, our approach can also be implemented at client side.

[1]  Matthias Hagen,et al.  From keywords to keyqueries: content descriptors for the web , 2013, SIGIR.

[2]  Flemming Topsøe,et al.  Jensen-Shannon divergence and Hilbert space embedding , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[3]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[4]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[5]  Matthias Hagen,et al.  ChatNoir: a search engine for the ClueWeb09 corpus , 2012, SIGIR '12.

[6]  Roman Kern,et al.  Analysis of structural relationships for hierarchical cluster labeling , 2010, SIGIR '10.

[7]  Roberto Navigli,et al.  Inducing Word Senses to Improve Web Search Result Clustering , 2010, EMNLP.

[8]  Matthias Hagen,et al.  Search Strategies for Keyword-based Queries , 2010, 2010 Workshops on Database and Expert Systems Applications.

[9]  Candidate Document Retrieval for Web-Scale Text Reuse Detection , 2011, SPIRE.

[10]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[11]  Benno Stein,et al.  Beyond precision@10: clustering the long tail of web search results , 2011, CIKM '11.

[12]  Sergey Yekhanin,et al.  Towards 3-query locally decodable codes of subexponential length , 2008, JACM.

[13]  Aristides Gionis,et al.  Topical query decomposition , 2008, KDD.

[14]  Qigang Gao,et al.  Using controlled query generation to evaluate blind relevance feedback algorithms , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[15]  Fazli Can,et al.  A New Approach to Search Result Clustering and Labeling , 2011, AIRS.

[16]  Matthew Cooper,et al.  Reverted indexing for feedback and expansion , 2010, CIKM.

[17]  Leif Azzopardi,et al.  Retrievability: an evaluation measure for higher order information access tasks , 2008, CIKM '08.

[18]  Benno Stein,et al.  Topic Identification: Framework and Application , 2022 .

[19]  Norbert FuhrMarc The optimum clustering framework: implementing the cluster hypothesis , 2012 .

[20]  W. Bruce Croft,et al.  Finding text reuse on the web , 2009, WSDM '09.

[21]  W. Bruce Croft,et al.  Evaluating verbose query processing techniques , 2010, SIGIR.

[22]  Ali Dasdan,et al.  Automatic retrieval of similar content using search engine query interface , 2009, CIKM.

[23]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[24]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[25]  Yin Yang,et al.  Query by document , 2009, WSDM '09.