XML Compression via Directed Acyclic Graphs

Unranked node-labeled trees can be represented using their minimal dag (directed acyclic graph). For XML this achieves high compression ratios due to their repetitive mark up. Unranked trees are often represented through first child/next sibling (fcns) encoded binary trees. We study the difference in size (= number of edges) of minimal dag versus minimal dag of the fcns encoded binary tree. One main finding is that the size of the dag of the binary tree can never be smaller than the square root of the size of the minimal dag, and that there are examples that match this bound. We introduce a new combined structure, the hybrid dag, which is guaranteed to be smaller than (or equal in size to) both dags. Interestingly, we find through experiments that last child/previous sibling encodings are much better for XML compression via dags, than fcns encodings. We determine the average sizes of unranked and binary dags over a given set of labels (under uniform distribution) in terms of their exact generating functions, and in terms of their asymptotical behavior.

[1]  Thomas Schwentick,et al.  Automata for XML - A survey , 2007, J. Comput. Syst. Sci..

[2]  Sebastian Maneth,et al.  Parameter reduction and automata evaluation for grammar-compressed trees , 2012, J. Comput. Syst. Sci..

[3]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[4]  Nachum Dershowitz,et al.  Enumerations of ordered trees , 1980, Discret. Math..

[5]  Jean-François Marckert,et al.  The rotation correspondence is asymptotically a dilatation , 2004, Random Struct. Algorithms.

[6]  Dan Suciu Typechecking for Semistructured Data , 2001, DBPL.

[7]  Philippe Flajolet,et al.  The Average Height of Binary Trees and Other Simple Trees , 1982, J. Comput. Syst. Sci..

[8]  Ioana Manolescu,et al.  XQueC: A query-conscious compressed XML database , 2007, TOIT.

[9]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition , 1997 .

[10]  Philippe Flajolet,et al.  Analytic Variations on the Common Subexpression Problem , 1990, ICALP.

[11]  Dan Suciu,et al.  XMill: an efficient compressor for XML data , 2000, SIGMOD '00.

[12]  Philippe Flajolet,et al.  Analytic Combinatorics , 2009 .

[13]  Wojciech Plandowski,et al.  Testing Equivalence of Morphisms on Context-Free Languages , 1994, ESA.

[14]  Sebastian Maneth,et al.  XML compression via DAGs , 2013, ICDT '13.

[15]  Dan Suciu,et al.  XMill: an efficient compressor for XML data , 2000, SIGMOD 2000.

[16]  M. Friedman,et al.  On Programming of Arithmetic Operations , .

[17]  P. Flajolet,et al.  Analytic Combinatorics: RANDOM STRUCTURES , 2009 .

[18]  Christoph Meinel,et al.  Algorithms and Data Structures in VLSI Design: OBDD - Foundations and Applications , 2012 .

[19]  Peter Buneman,et al.  Edinburgh Research Explorer Path Queries on Compressed XML , 2022 .

[20]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[21]  de Ng Dick Bruijn,et al.  THE AVERAGE HEIGHT OF PLANTED PLANE TREES , 1972 .

[22]  Gad M. Landau,et al.  Random access to grammar-compressed strings , 2010, SODA '11.

[23]  Philippe Flajolet,et al.  Singularity Analysis of Generating Functions , 1990, SIAM J. Discret. Math..

[24]  Sebastian Maneth,et al.  XML tree structure compression using RePair , 2013, Inf. Syst..

[25]  A. Moffat,et al.  Offline dictionary-based compression , 2000, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[26]  Sebastian Maneth,et al.  Fast and Tiny Structural Self-Indexes for XML , 2010, ArXiv.

[27]  Frank Neven,et al.  Automata theory for XML researchers , 2002, SGMD.

[28]  Donald E. Knuth,et al.  The art of computer programming: V.1.: Fundamental algorithms , 1997 .

[29]  Robert E. Tarjan,et al.  Variations on the Common Subexpression Problem , 1980, J. ACM.

[30]  Prof. Dr. Christoph Meinel,et al.  Algorithms and Data Structures in VLSI Design , 1998, Springer Berlin Heidelberg.

[31]  Ronald C. Read,et al.  Graph theory and computing , 1972 .

[32]  Sebastian Maneth,et al.  The complexity of tree automata and XPath on grammar-compressed trees , 2006, Theor. Comput. Sci..

[33]  Sebastian Maneth,et al.  Efficient memory representation of XML document trees , 2008, Inf. Syst..

[34]  Jakub Závodný,et al.  FDB: A Query Engine for Factorised Relational Databases , 2012, Proc. VLDB Endow..

[35]  Markus Lohrey,et al.  Algorithmics on SLP-compressed strings: A survey , 2012, Groups Complex. Cryptol..

[36]  Christoph Koch,et al.  Efficient Processing of Expressive Node-Selecting Queries on XML Data in Secondary Storage: A Tree Automata-based Approach , 2003, VLDB.

[37]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..