Content-Aware DataGuides: Interleaving IR and DB Indexing Techniques for Efficient Retrieval of Textual XML Data

Not only since the advent of XML, many applications call for e.cient structured document retrieval, challenging both Information Retrieval (IR) and database (DB) research. Most approaches combining indexing techniques from both .elds still separate path and content matching, merging the hits in an expensive join. This paper shows that retrieval is signi.cantly accelerated by processing text and structure simultaneously. The Content-Aware DataGuide (CADG) interleaves IR and DB indexing techniques to minimize path matching and suppress joins at query time, also saving needless I/O operations during retrieval. Extensive experiments prove the CADG to outperform the DataGuide [11,14] by a factor 5 to 200 on average. For structurally unselective queries, it is over 400 times faster than the DataGuide. The best results were achieved on large collections of heterogeneously structured textual documents.

[1]  Holger Meuss,et al.  Improving Index Structures for Structured Document Retrieval , 1999, BCS-IRSG Annual Colloquium on IR Research.

[2]  Dongwook Shin,et al.  BUS: an effective indexing and retrieval scheme in structured documents , 1998, DL '98.

[3]  Roy Goldman,et al.  Lore: a database management system for semistructured data , 1997, SGMD.

[4]  Tat-Seng Chua,et al.  Hierarchical Indexing and Flexible Element Retrieval for Structured Document , 2003, ECIR.

[5]  Alfred V. Aho,et al.  Data Structures and Algorithms , 1983 .

[6]  Michael J. Franklin,et al.  A Fast Index for Semistructured Data , 2001, VLDB.

[7]  Jeffrey F. Naughton,et al.  On the integration of structure indexes and inverted lists , 2004, Proceedings. 20th International Conference on Data Engineering.

[8]  Karl Aberer,et al.  Combining Pat-Trees and Signature Files for Query Evaluation in Document Databases , 1999, DEXA.

[9]  Quanzhong Li,et al.  Indexing and Querying XML Data for Regular Path Expressions , 2001, VLDB.

[10]  Kotagiri Ramamohanarao,et al.  Inverted files versus signature files for text indexing , 1998, TODS.

[11]  Norbert Fuhr,et al.  XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[12]  Felix Weigel Content-Aware DataGuides for Indexing Semi-Structured Data , 2003 .

[13]  Armin B. Cremers,et al.  Searching and browsing collections of structural information , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[14]  Felix Weigel A Survey of Indexing Techniques for Semistructured Documents, Institute of Computer Science, LMU, Mu , 2002 .

[15]  Ricardo A. Baeza-Yates,et al.  Integrating contents and structure in text retrieval , 1996, SGMD.

[16]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[17]  Christos Faloutsos,et al.  Signature files: design and performance comparison of some signature extraction methods , 1985, SIGMOD Conference.

[18]  John Beidler,et al.  Data Structures and Algorithms , 1996, Wiley Encyclopedia of Computer Science and Engineering.

[19]  Torsten Schlieder,et al.  Querying and ranking XML documents , 2002, J. Assoc. Inf. Sci. Technol..

[20]  Raymond K. Wong,et al.  A fast and versatile path index for querying semi-structured data , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[21]  Juergen Oesterle,et al.  Toe GNoP (German noun phrase) treebank , 1998 .

[22]  François Bry,et al.  Visual Querying and Exploration of Large Answers in XML Databases with X2. , 2003, IEEE International Conference on Data Engineering.