Index compression vs. retrieval time of inverted files for XML documents

Query languages for retrieval of XML documents allow for conditions referring both to the content and the structure of documents. In this paper, we investigate two different approaches for reducing index space of inverted files for XML documents. First, we consider methods for compressing index entries. Second, we develop the new XS tree data structure which contains the structural description of a document in a rather compact form, such that these descriptions can be kept in main memory. Experimental results on two large XML document collections show that very high compression rates for indexes can be achieved, but any compression increases retrieval time. On the other hand, highly compressed indexes may be feasible for applications where storage is limited, such as in PDAs or E-book devices.

[1]  W. S. Cooper Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems , 1968 .

[2]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[3]  Fausto Rabitti,et al.  Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval , 1986 .

[4]  Edward A. Fox,et al.  Inverted Files , 1992, Information Retrieval: Data Structures & Algorithms.

[5]  Daniel S. Hirschberg,et al.  Efficient decoding of prefix codes , 1990, CACM.

[6]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[7]  Vijay V. Raghavan,et al.  Bitmap Indexing-based Clustering and Retrieval of XML Documents , 2001 .

[8]  Ross Wilkinson,et al.  Effective retrieval of structured documents , 1994, SIGIR '94.

[9]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[10]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[11]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[12]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[13]  Norbert Fuhr,et al.  XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[14]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[15]  Dan Suciu,et al.  XMill: an efficient compressor for XML data , 2000, SIGMOD '00.

[16]  Donald D. Chamberlin,et al.  XQuery: a query language for XML , 2003, SIGMOD '03.