Comparative Analysis of XML Compression Technologies

XML provides flexibility in publishing and exchanging heterogeneous data on the Web. However, the language is by nature verbose and thus XML documents are usually larger in size than other specifications containing the same data content. It is natural to expect that the data size will continue to grow as XML data proliferates on the Web. The size problem of XML documents hinders the applications of XML, since it substantially increases the costs of storing, processing and exchanging the data. The hindrance is more apparent in bandwidth- and memory-limited settings such as those applications related to mobile communication.In this paper, we survey a range of recently proposed XML specific compression technologies and study their efforts and capabilities to overcome the size problem. First, by categorizing XML compression technologies into queriable and unqueriable compressors, we explain the efforts in the representative technologies that aim at utilizing the exposed structure information from the input XML documents. Second, we discuss the importance of queriable XML compressors and assess whether the compressed XML documents generated from these technologies are able to support direct querying on XML data. Finally, we present a comparative analysis of the state-of-the-art XML conscious compression technologies in terms of compression ratio, compression and decompression times, memory consumption, and query performance.

[1]  Neel Sundaresan,et al.  Millau: an encoding format for efficient representation and exchange of XML over the Web , 2000, Comput. Networks.

[2]  Wilfred Ng Evaluating the Client Side Approach and the Server Side Approach to the WWW and DBMSs Integration , 1999 .

[3]  Sven Helmer,et al.  The implementation and performance of compressed databases , 2000, SGMD.

[4]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[5]  Abraham Silberschatz,et al.  Operating Systems Concepts , 2005 .

[6]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[7]  Wilfred Ng,et al.  XQzip: Querying Compressed XML Using Structural Indexing , 2004, EDBT.

[8]  Jonathan Robie,et al.  Document Object Model (DOM) Level 2 Specification , 1998 .

[9]  James Cheney Compressing XML with multiplexed hierarchical PPM models , 2001, Proceedings DCC 2001. Data Compression Conference.

[10]  John G. Cleary,et al.  Unbounded Length Contexts for PPM , 1997 .

[11]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[12]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[13]  Mario Cannataro,et al.  SqueezeX: synthesis and compression of XML data , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.

[14]  Mark Allen Weiss,et al.  Data structures and algorithm analysis in C , 1991 .

[15]  Dan Suciu,et al.  XMill: an efficient compressor for XML data , 2000, SIGMOD '00.

[16]  Neel Sundaresan,et al.  Efficient representation and streaming of XML content over the Internet medium , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[17]  M. W. Shields An Introduction to Automata Theory , 1988 .

[18]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[19]  Chinya V. Ravishankar,et al.  Block-Oriented Compression Techniques for Large Statistical Databases , 1997, IEEE Trans. Knowl. Data Eng..

[20]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.

[21]  Keishi Tajima,et al.  Archiving scientific data , 2002, SIGMOD '02.

[22]  Ioana Manolescu,et al.  Efficient Query Evaluation over Compressed XML Data , 2004, EDBT.

[23]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[24]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[25]  Neel Sundaresan,et al.  Algorithms and programming models for efficient representation of XML for Internet applications , 2002, Comput. Networks.

[26]  Mark Levene,et al.  XCQ: XML Compression and Querying System , 2003, WWW.

[27]  Hiroshi Ishikawa,et al.  Project Xanadu: XML- and active-database-unified approach to distributed e-commerce , 2001, 12th International Workshop on Database and Expert Systems Applications.

[28]  Chin-Wan Chung,et al.  XPRESS: a queriable compression for XML data , 2003, SIGMOD '03.

[29]  Mark Levene,et al.  XML Structure Compression , 2002, WebDyn@WWW.

[30]  Gennady Antoshenkov,et al.  Dictionary-based order-preserving string compression , 1997, The VLDB Journal.

[31]  Peter Buneman,et al.  Edinburgh Research Explorer Path Queries on Compressed XML , 2022 .

[32]  Jayant R. Haritsa,et al.  XGrind: a query-friendly XML compressor , 2002, Proceedings 18th International Conference on Data Engineering.

[33]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[34]  Abraham Silberschatz,et al.  Operating System Concepts, 5th Edition , 1994 .

[36]  Mario Cannataro,et al.  Semantic Lossy Compression of XML Data , 2001, KRDB.

[37]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[38]  Jeffrey F. Naughton,et al.  Covering indexes for branching path queries , 2002, SIGMOD '02.

[39]  Ioana Manolescu,et al.  Xquec: Pushing Queries to Compressed XML Data , 2003, VLDB.

[40]  Abraham Silberschatz,et al.  Operating System Concepts , 1983 .

[41]  Steven J. DeRose,et al.  XML Path Language (XPath) , 1999 .

[42]  Amélie Marian,et al.  Projecting XML Documents , 2003, VLDB.

[43]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[44]  守屋 悦朗,et al.  J.E.Hopcroft, J.D. Ullman 著, "Introduction to Automata Theory, Languages, and Computation", Addison-Wesley, A5変形版, X+418, \6,670, 1979 , 1980 .