XXS

The eXtensible Markup Language (XML) is acknowledged as the de facto standard for semistructured data representation and data exchange on the Web and many other scenarios. A well-known shortcoming of XML is its verbosity, which increases manipulation, transmission, and processing costs. Various structure-blind and structure-conscious compression techniques can be applied to XML, and some are even access-friendly, meaning that the documents can be efficiently accessed in compressed form. Direct access is necessary to implement the query languages XPath and XQuery, which are the standard ones to exploit the expressiveness of XML. While a good deal of theoretical and practical proposals exist to solve XPath/XQuery operations on XML, only a few ones are well integrated with a compression format that supports the required access operations on the XML data. In this work we go one step further and design a compression format for XML collections that boosts the performance of XPath queries on the data. This is done by designing compressed representations of the XML data that support some complex operations apart from just accessing the data, and those are exploited to solve key components of the XPath queries. Our system, called XXS, is aimed at XML collections containing natural language text, which are compressed to within 35%--50% of their original size while supporting a large subset of XPath operations in time competitive with, and many times outperforming, the best state-of-the-art systems that work on uncompressed representations.

[1]  Djoerd Hiemstra,et al.  TIJAH: Embracing IR Methods in XML Databases , 2005, Information Retrieval.

[2]  Fabrizio Luccio,et al.  Compressing and searching XML data via two zips , 2006, WWW '06.

[3]  M. Neumüller,et al.  Compression of XML Data , 2001 .

[4]  J. Ian Munro,et al.  Succinct Representation of Balanced Parentheses and Static Trees , 2002, SIAM J. Comput..

[5]  Weimin Li,et al.  XCOMP: AN XML COMPRESSION TOOL , 2003 .

[6]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[7]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[8]  Dan Suciu,et al.  XMill: an efficient compressor for XML data , 2000, SIGMOD '00.

[9]  Gonzalo Navarro,et al.  Fully-functional succinct trees , 2010, SODA '10.

[10]  Mark Levene,et al.  XCQ: A queriable XML compression system , 2006, Knowledge and Information Systems.

[11]  Ioana Manolescu,et al.  XQueC: A query-conscious compressed XML database , 2007, TOIT.

[12]  Sherif Sakr,et al.  XML compression techniques: A survey and comparison , 2009, J. Comput. Syst. Sci..

[13]  Jayant R. Haritsa,et al.  XGrind: a query-friendly XML compressor , 2002, Proceedings 18th International Conference on Data Engineering.

[14]  Gonzalo Navarro,et al.  Implicit indexing of natural language text by reorganizing bytecodes , 2012, Information Retrieval.

[15]  Priti Shankar,et al.  Compressing XML Documents Using Recursive Finite State Automata , 2005, CIAA.

[16]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[17]  Amélie Marian,et al.  Implementing Xquery 1.0: The Galax Experience , 2003, VLDB.

[18]  A. Belén,et al.  Compressed self-indexed XML representation with efficient XPath evaluation , 2013 .

[19]  Jakub Swacha,et al.  Effective asymmetric XML compression , 2008 .

[20]  Wolfgang Meier,et al.  eXist: An Open Source Native XML Database , 2002, Web, Web-Services, and Database Systems.

[21]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[22]  M. Tamer Özsu,et al.  XBench - A Family of Benchmarks for XML DBMSs , 2002, EEXTT.

[23]  Gonzalo Navarro,et al.  Fast in-memory XPath search using compressed indexes , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[24]  Gonzalo Navarro,et al.  Using structural contexts to compress semistructured text collections , 2007, Inf. Process. Manag..

[25]  Stefanie Scherzinger,et al.  Combined Static and Dynamic Analysis for Effective Buffer Minimization in Streaming XQuery Evaluation , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[26]  Torsten Grust,et al.  MonetDB/XQuery: a fast XQuery processor powered by a relational engine , 2006, SIGMOD Conference.

[27]  Chin-Wan Chung,et al.  XPRESS: a queriable compression for XML data , 2003, SIGMOD '03.

[28]  Gregory Leighton,et al.  TREECHOP: A Tree-based Query-able Compressor for XML , 2005 .

[29]  Neel Sundaresan,et al.  Millau: an encoding format for efficient representation and exchange of XML over the Web , 2000, Comput. Networks.

[30]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[31]  Mark Levene,et al.  XML Structure Compression , 2002, WebDyn@WWW.

[32]  Gonzalo Navarro,et al.  Lempel-Ziv compression of highly structured documents , 2007, J. Assoc. Inf. Sci. Technol..

[33]  Nieves R. Brisaboa,et al.  Ranked Document Retrieval in (Almost) No Space , 2012, SPIRE.

[34]  Wilfred Ng,et al.  XQzip: Querying Compressed XML Using Structural Indexing , 2004, EDBT.

[35]  Christopher League,et al.  Schema-Based Compression of XML Data with Relax NG , 2007, J. Comput..

[36]  Raymond K. Wong,et al.  Querying and maintaining a compact XML storage , 2007, WWW '07.

[37]  James Cheney Compressing XML with multiplexed hierarchical PPM models , 2001, Proceedings DCC 2001. Data Compression Conference.

[38]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[39]  Gonzalo Navarro,et al.  Lempel-Ziv compression of highly structured documents: Research Articles , 2007 .

[40]  J. Shane Culpepper,et al.  Enhanced Byte Codes with Restricted Prefix Properties , 2005, SPIRE.

[41]  Nieves R. Brisaboa,et al.  A compressed self-indexed representation of XML documents , 2010, JISBD.

[42]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[43]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[44]  Sebastian Maneth,et al.  Fast and Tiny Structural Self-Indexes for XML , 2010, ArXiv.

[45]  Dan Olteanu,et al.  SPEX: Streamed and Progressive Evaluation of XPath , 2007, IEEE Transactions on Knowledge and Data Engineering.

[46]  Sudarshan S. Chawathe,et al.  XSQ: A streaming XPath engine , 2005, TODS.

[47]  Gonzalo Navarro,et al.  Lightweight natural language text compression , 2006, Information Retrieval.

[48]  Ricardo A. Baeza-Yates,et al.  Fast and flexible word searching on compressed text , 2000, TOIS.

[49]  Quanzhong Li,et al.  Supporting efficient query processing on compressed XML files , 2005, SAC '05.

[50]  Michael H. Kay Ten Reasons Why Saxon XQuery is Fast , 2008, IEEE Data Eng. Bull..

[51]  Ricardo A. Baeza-Yates,et al.  Proximal nodes: a model to query document databases by content and structure , 1997, TOIS.

[52]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[53]  Andrew Trotman,et al.  XML Retrieval , 2009, Encyclopedia of Database Systems.

[54]  Jianzhong Li,et al.  XCpaqs: compression of XML document with XPath query support , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[55]  Tomasz Müldner,et al.  AXECHOP: a grammar-based compressor for XML , 2005, Data Compression Conference.

[56]  Fabrizio Luccio,et al.  Compressing and indexing labeled trees, with applications , 2009, JACM.