Optimizing index for taxonomy keyword search

Query substitution is an important problem in information retrieval. Much work focuses on how to find substitutes for any given query. In this paper, we study how to efficiently process a keyword query whose substitutes are defined by a given taxonomy. This problem is challenging because each term in a query can have a large number of substitutes, and the original query can be rewritten into any of their combinations. We propose to build an additional index (besides inverted index) to efficiently process queries. For a query workload, we formulate an optimization problem which chooses the additional index structure, aiming at minimizing the query evaluation cost, under given index space constraints. We show the NP-hardness of the problem, and propose a pseudo-polynomial time algorithm using dynamic programming, as well as an 1 over 4(1-1/e)-approximation algorithm to solve the problem. Experimental results show that, with only 10% additional index space, our approach can greatly reduce the query evaluation cost.

[1]  Pablo Castells,et al.  An Ontology-Based Information Retrieval Model , 2005, ESWC.

[2]  Enhong Chen,et al.  Context-aware query suggestion by mining click-through and session data , 2008, KDD.

[3]  Soumen Chakrabarti,et al.  Optimizing scoring functions and indexes for proximity search in type-annotated corpora , 2006, WWW '06.

[4]  Seung-won Hwang,et al.  Web scale taxonomy cleansing , 2011, Proc. VLDB Endow..

[5]  Howard J. Karloff,et al.  On the complexity of the view-selection problem , 1999, PODS '99.

[6]  Kenneth Ward Church,et al.  Query suggestion using hitting time , 2008, CIKM '08.

[7]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[8]  Daniel Jurafsky,et al.  Semantic Taxonomy Induction from Heterogenous Evidence , 2006, ACL.

[9]  Torsten Suel,et al.  Three-level caching for efficient query processing in large Web search engines , 2005, WWW.

[10]  Filip Radlinski,et al.  Optimizing relevance and revenue in ad search: a query substitution approach , 2008, SIGIR '08.

[11]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[12]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[13]  David P. Williamson,et al.  The Design of Approximation Algorithms , 2011 .

[14]  Haixun Wang,et al.  Short Text Conceptualization Using a Probabilistic Knowledgebase , 2011, IJCAI.

[15]  Douglas B. Lenat,et al.  CYC: a large-scale investment in knowledge infrastructure , 1995, CACM.

[16]  Sergei Vassilvitskii,et al.  Top-k aggregation using intersections of ranked inputs , 2009, WSDM '09.

[17]  Susan T. Dumais,et al.  Similarity Measures for Short Segments of Text , 2007, ECIR.

[18]  Sergei Vassilvitskii,et al.  Efficiently encoding term co-occurrences in inverted indexes , 2011, CIKM '11.

[19]  Nicolas Hanusse,et al.  A view selection algorithm with performance guarantee , 2009, EDBT '09.

[20]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[21]  Bolin Ding,et al.  Fast Set Intersection in Memory , 2011, Proc. VLDB Endow..

[22]  Maxim Sviridenko,et al.  A note on maximizing a submodular set function subject to a knapsack constraint , 2004, Oper. Res. Lett..

[23]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[24]  Carlos Guestrin,et al.  A Note on the Budgeted Maximization of Submodular Functions , 2005 .

[25]  Leonard Pitt,et al.  Optimal indexing using near-minimal space , 2003, PODS '03.

[26]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.