An Efficient and Versatile Query Engine for TopX Search

This paper presents a novel engine, coined TopX, for efficient ranked retrieval of XML documents over semistructured but nonschematic data collections. The algorithm follows the paradigm of threshold algorithms for top-k query processing with a focus on inexpensive sequential accesses to index lists and only a few judiciously scheduled random accesses. The difficulties in applying the existing top-k algorithms to XML data lie in 1) the need to consider scores for XML elements while aggregating them at the document level, 2) the combination of vague content conditions with XML path conditions, 3) the need to relax query conditions if too few results satisfy all conditions, and 4) the selectivity estimation for both content and structure conditions and their impact on evaluation strategies. TopX addresses these issues by precomputing score and path information in an appropriately designed index structure, by largely avoiding or postponing the evaluation of expensive path conditions so as to preserve the sequential access pattern on index lists, and by selectively scheduling random accesses when they are cost-beneficial. In addition, TopX can compute approximate top-k results using probabilistic score estimators, thus speeding up queries with a small and controllable loss in retrieval precision.

[1]  Chris Buckley,et al.  Optimization of inverted vector searches , 1985, SIGIR '85.

[2]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[3]  Justin Zobel,et al.  Filtered Document Retrieval with Frequency-Sorted Indexes , 1996, J. Am. Soc. Inf. Sci..

[4]  Michael J. Carey,et al.  Reducing the Braking Distance of an SQL Query Engine , 1998, VLDB.

[5]  Surya Nepal,et al.  Query processing issues in image (multimedia) databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[6]  Werner Kießling,et al.  Optimizing Multi-Feature Queries for Image Databases , 2000, VLDB.

[7]  Gerhard Weikum,et al.  Adding Relevance to XML , 2000, WebDB.

[8]  Alistair Moffat,et al.  Vector-space ranking with effective early termination , 2001, SIGIR '01.

[9]  M. Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[10]  David J. DeWitt,et al.  On supporting containment queries in relational database management systems , 2001, SIGMOD '01.

[11]  Nicholas Kushmerick,et al.  Expressive retrieval from XML documents , 2001, SIGIR '01.

[12]  Jeffrey F. Naughton,et al.  Estimating the Selectivity of XML Path Expressions for Internet Scale Applications , 2001, VLDB.

[13]  Norbert Fuhr,et al.  XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[14]  Divesh Srivastava,et al.  Holistic twig joins: optimal XML pattern matching , 2002, SIGMOD '02.

[15]  Torsten Schlieder,et al.  Querying and ranking XML documents , 2002, J. Assoc. Inf. Sci. Technol..

[16]  Torsten. Grust,et al.  Accelerating XPath location steps , 2002, SIGMOD '02.

[17]  Martin L. Kersten,et al.  Efficient k-NN search on vertically decomposed data , 2002, SIGMOD '02.

[18]  Jeffrey Scott Vitter,et al.  XPathLearner: An On-line Self-Tuning Markov Histogram for XML Path Selectivity Estimation , 2002, VLDB.

[19]  Jignesh M. Patel,et al.  Structural joins: a primitive for efficient XML query pattern matching , 2002, Proceedings 18th International Conference on Data Engineering.

[20]  Luis Gravano,et al.  Evaluating top-k queries over Web-accessible databases , 2002, Proceedings 18th International Conference on Data Engineering.

[21]  Seung-won Hwang,et al.  Minimal probing: supporting expensive predicates for top-k queries , 2002, SIGMOD '02.

[22]  Gerhard Weikum,et al.  The Index-Based XXL Search Engine for Querying XML Data with Relevance Ranking , 2002, EDBT.

[23]  Aristides Gionis,et al.  Automated Ranking of Database Query Results , 2003, CIDR.

[24]  Hongjun Lu,et al.  Holistic Twig Joins on Indexed XML Documents , 2003, VLDB.

[25]  Vagelis Hristidis,et al.  Keyword proximity search on XML graphs , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[26]  Feng Shao,et al.  XRANK: ranked keyword search over XML documents , 2003, SIGMOD '03.

[27]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[28]  Yehoshua Sagiv,et al.  XSEarch: A Semantic Search Engine for XML , 2003, VLDB.

[29]  Torsten Suel,et al.  Optimized Query Execution in Large Search Engines with Global Page Ordering , 2003, VLDB.

[30]  Cong Yu,et al.  Querying structured text in an XML database , 2003, SIGMOD '03.

[31]  Hans-Jörg Schek,et al.  PowerDB-XML: Scalable XML Processing with a Database Cluster , 2003, Intelligent Search on XML Data.

[32]  Andrew Trotman,et al.  Narrowed Extended XPath I (NEXI) , 2004, INEX.

[33]  Gerhard Weikum,et al.  Top-k Query Evaluation with Probabilistic Guarantees , 2004, VLDB.

[34]  Jeffrey F. Naughton,et al.  On the integration of structure indexes and inverted lists , 2004, Proceedings. 20th International Conference on Data Engineering.

[35]  Luis Gravano,et al.  Evaluating top-k queries over web-accessible databases , 2004, TODS.

[36]  Walid G. Aref,et al.  Rank-aware query optimization , 2004, SIGMOD '04.

[37]  Laks V. S. Lakshmanan,et al.  FleXPath: flexible structure and full-text querying for XML , 2004, SIGMOD '04.

[38]  Cong Yu,et al.  Schema-Free XQuery , 2004, VLDB.

[39]  Sihem Amer-Yahia,et al.  Adaptive processing of top-k queries in XML , 2005, 21st International Conference on Data Engineering (ICDE'05).