TopX - Efficient and Versatile Top-k Query Processing for Text, Semistructured, and Structured Data

This paper presents a comprehensive overview of the TopX search engine, an extensive framework for unified indexing and querying large collections of unstructured, semistructured, and structured data. Residing at the very synapse of database(DB) engineering and information retrieval (IR), it integrates efficient scheduling algorithms for top-k-style ranked retrieval with powerful scoring models, as well as dynamic and self-throttling query expansion facilities.

[1]  Luis Gravano,et al.  Evaluating top-k queries over web-accessible databases , 2004, TODS.

[2]  Randolph D. Nelson,et al.  Probability, stochastic processes, and queueing theory - the mathematics of computer performance modeling , 1995 .

[3]  Kevin Chen-Chuan Chang,et al.  RankSQL: Supporting Ranking Queries in Relational Database Management Systems , 2005, VLDB.

[4]  Pavel Zezula,et al.  Region proximity in metric spaces and its use for approximate similarity search , 2003, TOIS.

[5]  David Hawking,et al.  Overview of the TREC 2004 Web Track , 2004, TREC.

[6]  Cong Yu,et al.  Schema-Free XQuery , 2004, VLDB.

[7]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.

[8]  Ralf Schenkel,et al.  Feedback-Driven Structural Query Expansion for Ranked Retrieval of XML Data , 2006, EDBT.

[9]  Gerhard Weikum,et al.  The Index-Based XXL Search Engine for Querying XML Data with Relevance Ranking , 2002, EDBT.

[10]  William H. Press,et al.  Numerical recipes in C , 2002 .

[11]  Stephen Robertson Term frequency and term value , 1981, SIGIR 1981.

[12]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[13]  Ralf Schenkel,et al.  Structural Feedback for Keyword-Based XML Retrieval , 2006, ECIR.

[14]  Jeffrey F. Naughton,et al.  On the integration of structure indexes and inverted lists , 2004, Proceedings. 20th International Conference on Data Engineering.

[15]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[16]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[17]  Alistair Moffat,et al.  Vector-space ranking with effective early termination , 2001, SIGIR '01.

[18]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[19]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[20]  Ronald Fagin,et al.  Combining fuzzy information: an overview , 2002, SGMD.

[21]  Gerhard Weikum,et al.  Integrating DB and IR Technologies: What is the Sound of One Hand Clapping? , 2005, CIDR.

[22]  Stephen E. Robertson,et al.  Relevance weighting for query independent evidence , 2005, SIGIR '05.

[23]  Yasushi Ogawa,et al.  The use of phrases from query texts in information retrieval (poster session) , 2000, SIGIR '00.

[24]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[25]  David Hawking,et al.  Overview of the TREC 2003 Web Track , 2003, TREC.

[26]  Laks V. S. Lakshmanan,et al.  TAX: A Tree Algebra for XML , 2001, DBPL.

[27]  Yen-Jen Oyang,et al.  Relevant term suggestion in interactive web search based on contextual information in query session logs , 2003, J. Assoc. Inf. Sci. Technol..

[28]  Sihem Amer-Yahia,et al.  Adaptive processing of top-k queries in XML , 2005, 21st International Conference on Data Engineering (ICDE'05).

[29]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[30]  David Hawking,et al.  Proximity Operators - So Near And Yet So Far , 1995, TREC.

[31]  Georg Gottlob,et al.  The complexity of XPath query evaluation , 2003, PODS.

[32]  Charles L. A. Clarke,et al.  The TREC terabyte retrieval track , 2005, SIGF.

[33]  Nicholas Kushmerick,et al.  Expressive retrieval from XML documents , 2001, SIGIR '01.

[34]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[35]  David J. DeWitt,et al.  On supporting containment queries in relational database management systems , 2001, SIGMOD '01.

[36]  Kevin Chen-Chuan Chang,et al.  RankSQL: query algebra and optimization for relational top-k queries , 2005, SIGMOD '05.

[37]  Gerhard Weikum,et al.  BINGO!: bookmark-induced gathering of information , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[38]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[39]  Michael J. Franklin,et al.  A Fast Index for Semistructured Data , 2001, VLDB.

[40]  Vassilis J. Tsotras,et al.  Twig query processing over graph-structured XML data , 2004, WebDB '04.

[41]  Clement T. Yu,et al.  An effective approach to document retrieval via utilizing WordNet and recognizing phrases , 2004, SIGIR '04.

[42]  Charles L. A. Clarke,et al.  Indexing time vs. query time: trade-offs in dynamic information retrieval systems , 2005, CIKM '05.

[43]  Quanzhong Li,et al.  Indexing and Querying XML Data for Regular Path Expressions , 2001, VLDB.

[44]  Hongjun Lu,et al.  Holistic Twig Joins on Indexed XML Documents , 2003, VLDB.

[45]  Marco Patella,et al.  PAC nearest neighbor queries: Approximate and controlled search in high-dimensional and metric spaces , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[46]  Luis Gravano,et al.  Optimizing top-k selection queries over multimedia repositories , 2004, IEEE Transactions on Knowledge and Data Engineering.

[47]  Kevin S. McCurley,et al.  Analysis of anchor text for web search , 2003, SIGIR.

[48]  Michalis Vazirgiannis,et al.  Semantic Distances for Sets of Senses and Applications in Word Sense Disambiguation , 2005 .

[49]  Jeffrey Scott Vitter,et al.  XPathLearner: An On-line Self-Tuning Markov Histogram for XML Path Selectivity Estimation , 2002, VLDB.

[50]  Raghu Ramakrishnan,et al.  Probabilistic Optimization of Top N Queries , 1999, VLDB.

[51]  Michael J. Carey,et al.  On saying “Enough already!” in SQL , 1997, SIGMOD '97.

[52]  Walid G. Aref,et al.  Rank-aware query optimization , 2004, SIGMOD '04.

[53]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[54]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature , 1975, J. Am. Soc. Inf. Sci..

[55]  Ellen M. Voorhees,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[56]  Justin Zobel,et al.  Questioning Query Expansion: An Examination of Behaviour and Parameters , 2004, ADC.

[57]  V. S. Subrahmanian,et al.  TOSS: an extension of TAX with ontologies and similarity queries , 2004, SIGMOD '04.

[58]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[59]  Wolf-Tilo Balke,et al.  Towards efficient multi-feature queries in heterogeneous environments , 2001, Proceedings International Conference on Information Technology: Coding and Computing.

[60]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[61]  Torsten Suel,et al.  Optimized Query Execution in Large Search Engines with Global Page Ordering , 2003, VLDB.

[62]  Chris Buckley,et al.  Improving automatic query expansion , 1998, SIGIR '98.

[63]  Charles L. A. Clarke,et al.  Controlling overlap in content-oriented XML retrieval , 2005, SIGIR '05.

[64]  Laurent Amsaleg,et al.  Cost-based query scrambling for initial delays , 1998, SIGMOD '98.

[65]  W. Bruce Croft,et al.  A framework for selective query expansion , 2004, CIKM '04.

[66]  Norbert Fuhr,et al.  XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[67]  Torsten Grust,et al.  Staircase Join: Teach a Relational DBMS to Watch its (Axis) Steps , 2003, VLDB.

[68]  Jeffrey F. Naughton,et al.  Estimating the Selectivity of XML Path Expressions for Internet Scale Applications , 2001, VLDB.

[69]  Gerhard Weikum,et al.  IO-Top-k: index-access optimized top-k query processing , 2006, VLDB.

[70]  Gerhard Weikum,et al.  Efficient and self-tuning incremental query expansion for top-k query processing , 2005, SIGIR '05.

[71]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[72]  Luis Gravano,et al.  Top-k selection queries over relational databases: Mapping strategies and performance evaluation , 2002, TODS.

[73]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[74]  Chris Buckley,et al.  Optimization of inverted vector searches , 1985, SIGIR '85.

[75]  Stephen E. Robertson,et al.  On relevance weights with little relevance information , 1997, SIGIR '97.

[76]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[77]  Hugh E. Williams,et al.  Query expansion using associated queries , 2003, CIKM '03.

[78]  Aristides Gionis,et al.  Automated Ranking of Database Query Results , 2003, CIDR.

[79]  ChengXiang Zhai,et al.  An exploration of axiomatic approaches to information retrieval , 2005, SIGIR '05.

[80]  Gerhard Weikum,et al.  Adding Relevance to XML , 2000, WebDB.

[81]  Derick Wood,et al.  On the Optimality of Holistic Algorithms for Twig Queries , 2003, DEXA.

[82]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[83]  Gerhard Weikum,et al.  Word Sense Disambiguation for Exploiting Hierarchical Thesauri in Text Classification , 2005, PKDD.

[84]  Cong Yu,et al.  Querying structured text in an XML database , 2003, SIGMOD '03.

[85]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[86]  Walid G. Aref,et al.  Supporting top-kjoin queries in relational databases , 2004, The VLDB Journal.

[87]  Divesh Srivastava,et al.  Holistic twig joins: optimal XML pattern matching , 2002, SIGMOD '02.

[88]  Sihem Amer-Yahia,et al.  PIX: a system for phrase matching in XML documents: a demonstration , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[89]  Torsten. Grust,et al.  Accelerating XPath location steps , 2002, SIGMOD '02.

[90]  Martin L. Kersten,et al.  Efficient k-NN search on vertically decomposed data , 2002, SIGMOD '02.

[91]  Andrew Trotman,et al.  Narrowed Extended XPath I (NEXI) , 2004, INEX.

[92]  Mounia Lalmas,et al.  Advances in XML Information Retrieval: Third International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2004, Dagstuhl Castle, ... 2004 (Lecture Notes in Computer Science) , 2005 .

[93]  Sihem Amer-Yahia,et al.  Texquery: a full-text search extension to xquery , 2004, WWW '04.

[94]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track | NIST , 2005 .

[95]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[96]  M. Kendall,et al.  Rank Correlation Methods , 1949 .

[97]  Hanan Samet,et al.  Index-driven similarity search in metric spaces (Survey Article) , 2003, TODS.

[98]  Christos Faloutsos,et al.  The power-method: a comprehensive estimation technique for multi-dimensional queries , 2003, CIKM '03.

[99]  Jignesh M. Patel,et al.  Structural join order selection for XML query optimization , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[100]  Luis Gravano,et al.  Evaluating top-k queries over Web-accessible databases , 2002, Proceedings 18th International Conference on Data Engineering.

[101]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[102]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[103]  Joseph M. Hellerstein,et al.  Using state modules for adaptive query processing , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[104]  Surya Nepal,et al.  Query processing issues in image (multimedia) databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[105]  Ronald Fagin,et al.  Static index pruning for information retrieval systems , 2001, SIGIR '01.

[106]  Norbert Fuhr,et al.  Efficient processing of vague queries using a data stream approach , 1995, SIGIR '95.

[107]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD 2000.

[108]  Seung-won Hwang,et al.  Minimal probing: supporting expensive predicates for top-k queries , 2002, SIGMOD '02.

[109]  Hanan Samet,et al.  Distance browsing in spatial databases , 1999, TODS.

[110]  Hans-Jörg Schek,et al.  PowerDB-XML: Scalable XML Processing with a Database Cluster , 2003, Intelligent Search on XML Data.

[111]  James Allan,et al.  The effect of adding relevance information in a relevance feedback environment , 1994, SIGIR '94.

[112]  Christos Faloutsos,et al.  On packing R-trees , 1993, CIKM '93.

[113]  Leonidas Fegaras XQuery Processing with Relevance Ranking , 2004, XSym.

[114]  Jignesh M. Patel,et al.  Structural joins: a primitive for efficient XML query pattern matching , 2002, Proceedings 18th International Conference on Data Engineering.

[115]  Vagelis Hristidis,et al.  Keyword proximity search on XML graphs , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[116]  Hans-Peter Frei,et al.  Concept based query expansion , 1993, SIGIR.

[117]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[118]  Joseph M. Hellerstein,et al.  Lifting the Burden of History from Adaptive Query Processing , 2004, VLDB.

[119]  Laurent Amsaleg,et al.  Scrambling query plans to cope with unexpected delays , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[120]  Gerhard Weikum,et al.  The BINGO! focused crawler: from bookmarks to archetypes , 2002, Proceedings 18th International Conference on Data Engineering.

[121]  Aravind Srinivasan,et al.  Chernoff-Hoeffding bounds for applications with limited independence , 1995, SODA '93.

[122]  Claudio Carpineto,et al.  Fondazione Ugo Bordoni at TREC 2003: Robust and Web Track , 2003, TREC.

[123]  Justin Zobel,et al.  Techniques for Efficient Query Expansion , 2004, SPIRE.

[124]  Feng Shao,et al.  XRANK: ranked keyword search over XML documents , 2003, SIGMOD '03.

[125]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[126]  Neoklis Polyzotis,et al.  Approximate XML query answers , 2004, SIGMOD '04.

[127]  Gerhard Weikum,et al.  Top-k Query Evaluation with Probabilistic Guarantees , 2004, VLDB.

[128]  Gerhard Weikum,et al.  Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data , 2003, WebDB.

[129]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[130]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[131]  Clement T. Yu,et al.  Database selection for processing k nearest neighbors queries in distributed environments , 2001, JCDL '01.

[132]  Laks V. S. Lakshmanan,et al.  FleXPath: flexible structure and full-text querying for XML , 2004, SIGMOD '04.

[133]  Gerhard Weikum,et al.  An Efficient and Versatile Query Engine for TopX Search , 2005, VLDB.