We present a new framework for indexing, locating and ranking XML documents based on content and structural synopses extracted from the documents. Instead of indexing each single element or term in a document, we extract a structural summary and a small number of data synopses from the document, which are indexed in an efficient way suitable for query evaluation. Our query language is XPath extended with full-text search. The result of query evaluation is a ranked list of document locations that best match the query. We propose a novel aggregated ranking scheme, which is integrated into the query evaluation to score the documents based on those data synopses. Our experimental evaluation shows that our indexing scheme outperforms the standard XML indexing scheme based on inverted lists and our ranking scheme is effective in terms of precision and recall.
[1]
Sihem Amer-Yahia,et al.
Structure and Content Scoring for XML
,
2005,
VLDB.
[2]
Yehoshua Sagiv,et al.
XSEarch: A Semantic Search Engine for XML
,
2003,
VLDB.
[3]
David Carmel,et al.
Searching XML documents via XML fragments
,
2003,
SIGIR.
[4]
Charles L. A. Clarke,et al.
Controlling overlap in content-oriented XML retrieval
,
2005,
SIGIR '05.
[5]
Neoklis Polyzotis,et al.
XCluster Synopses for Structured XML Content
,
2006,
22nd International Conference on Data Engineering (ICDE'06).
[6]
Cong Yu,et al.
Querying structured text in an XML database
,
2003,
SIGMOD '03.
[7]
Jeffrey F. Naughton,et al.
Updates for Structure Indexes
,
2002,
VLDB.