Xmill: an Eecient Compressor for Xml Data

We describe a tool for compressing XML data, called XMill, that usually achieves about twice the compression ratio of gzip at roughly the same speed. The intended applications are XML data exchange and archiving. XMill does not need schema information (such as a DTD or an XML-Schema), but can exploit hints about such a schema in order to further improve the compression ratio. XMill incorporates and combines existing compressors in order to compress heterogeneous XML data: it uses zlib, the library function for gzip, as well as a collection of datatype speciic compressors. XMill can be extended with new specialized compressors: this is useful in applications managing XML data with highly specialized data types, such DNA sequences, images, etc. The paper presents a theoretical justiication for the method used, XMill architecture and implementation, a new languages for expression the hints about the XML schema, and a series of experiments validating XMill on several real data sets.

[1]  G. H. Hamm,et al.  The EMBL data library , 1993, Nucleic Acids Res..

[2]  Dan Suciu,et al.  An extensible compressor for XML data , 2000, SGMD.

[3]  Dominique Perrin,et al.  Finite Automata , 1958, Philosophy.

[4]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[5]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[6]  David Salomon,et al.  Data Compression: The Complete Reference , 2006 .

[7]  Dan Suciu,et al.  Adding Structure to Unstructured Data , 1997, ICDT.

[8]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[9]  Jonathan Goldstein,et al.  Compressing relations and indexes , 1998, Proceedings 14th International Conference on Data Engineering.

[10]  Alin Deutsch,et al.  Storing semistructured data with STORED , 1999, SIGMOD '99.

[11]  Serge Abiteboul,et al.  Inferring structure in semistructured data , 1997, SGMD.

[12]  SuciuDan,et al.  A query language and optimization techniques for unstructured data , 1996 .

[13]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[14]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[15]  Alin Deutsch,et al.  A Query Language for XML , 1999, Comput. Networks.

[16]  Stéphane Grumbach,et al.  A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[17]  Dan Suciu,et al.  XMill: an efficient compressor for XML data , 2000, SIGMOD '00.

[18]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[19]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[20]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[21]  Serge Abiteboul,et al.  From structured documents to novel query facilities , 1994, SIGMOD '94.

[22]  Trevor I. Dix,et al.  Compression of Strings with Approximate Repeats , 1998, ISMB.

[23]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[24]  Arvind Malhotra,et al.  Xml schema part 2: datatypes , 1999 .

[25]  David J. DeWitt,et al.  Relational Databases for Querying XML Documents: Limitations and Opportunities , 1999, VLDB.

[26]  Balakrishna R. Iyer,et al.  Data Compression Support in Databases , 1994, VLDB.

[27]  Chinya V. Ravishankar,et al.  Block-Oriented Compression Techniques for Large Statistical Databases , 1997, IEEE Trans. Knowl. Data Eng..

[28]  Sophie Cluet,et al.  Your mediators need data conversion! , 1998, SIGMOD '98.

[29]  Mark A. Roth,et al.  Database compression , 1993, SGMD.

[30]  G.G. Langdon,et al.  Data compression , 1988, IEEE Potentials.