A GML compression approach based on on-line semantic clustering

Geography Markup Language (GML) has become a de facto international encoding standard for exchanging geospatial data among heterogeneous Geographic Information Systems (GIS). Whereas, structurally redundant tags and textual data representation usually inflate the sizes of GML documents substantially, which makes the storage and transport costly. In this paper, we propose an effective compression approach based on on-line semantic clustering of GML documents. The approach deals with a GML document under compression on the fly via separating data from structures, clustering data based on the semantic similarities exploited from tags and texts, dictionary-encoding structures and delta-encoding geometric coordinate data before the general text compression on back end. We conduct extensive experiments on real GML documents to evaluate the performance of the proposed approach. Results show that our approach outperforms the most popular general text compressor gzip, the acknowledged best XML compressor XMill, and the first and up to now the only GML compressor GPress in compression ratio.

[1]  Fabrizio Luccio,et al.  Compressing and searching XML data via two zips , 2006, WWW '06.

[2]  Dan Suciu,et al.  XMill: an efficient compressor for XML data , 2000, SIGMOD 2000.

[3]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[4]  Raymond K. Wong,et al.  Querying and maintaining a compact XML storage , 2007, WWW '07.

[5]  Ioana Manolescu,et al.  Xquec: Pushing Queries to Compressed XML Data , 2003, VLDB.

[6]  Gonzalo Navarro,et al.  SCM: Structural Contexts Model for Improving Compression in Semistructured Text Databases , 2003, SPIRE.

[7]  Shuigeng Zhou,et al.  GPress: Towards Effective GML Documents Compresssion , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[8]  James Cheney Compressing XML with multiplexed hierarchical PPM models , 2001, Proceedings DCC 2001. Data Compression Conference.

[9]  Vojtěch Toman,et al.  Syntactical Compression of XML Data , 2004 .

[10]  Jianzhong Li,et al.  XCpaqs: compression of XML document with XPath query support , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[11]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[12]  Max J. Egenhofer,et al.  Assessing semantic similarity among spatial entity classes , 2000 .

[13]  James Cheney An Empirical Evaluation of Simple DTD-Conscious Compression Techniques , 2005, WebDB.

[14]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[15]  Chin-Wan Chung,et al.  XPRESS: a queriable compression for XML data , 2003, SIGMOD '03.

[16]  Mark Levene,et al.  XML Structure Compression , 2002, WebDyn@WWW.

[17]  Wilfred Ng,et al.  XQzip: Querying Compressed XML Using Structural Indexing , 2004, EDBT.

[18]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[19]  Shuigeng Zhou,et al.  An Effective GML Documents Compressor , 2008, IEICE Trans. Inf. Syst..

[20]  Gregory Leighton,et al.  TREECHOP: A Tree-based Query-able Compressor for XML , 2005 .

[21]  Neel Sundaresan,et al.  Millau: an encoding format for efficient representation and exchange of XML over the Web , 2000, Comput. Networks.

[22]  Dan Suciu,et al.  XMill: an efficient compressor for XML data , 2000, SIGMOD '00.

[23]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[24]  Mark Levene,et al.  XCQ: A queriable XML compression system , 2006, Knowledge and Information Systems.