Beyond Lazy XML Parsing

XML has become the standard format for data representation and exchange in domains ranging from Web to desktop applications. However, wide adoption of XML is hindered by inefficient document-parsing methods. Recent work on lazy parsing is a major step towards alleviating this problem. However, lazy parsers must still read the entire XML document in order to extract the overall document structure, due to the lack of internal navigation pointers inside XML documents. Further, these parsers must load and parse the entire virtual document tree into memory during XML query processing. These overheads significantly degrade the performance of navigation operations. We have developed a framework for efficient XML parsing based on the idea of placing internal physical pointers within the document, which allows skipping large portions of the document during parsing. The internal pointers are generated in a way that optimizes parsing for common navigation patterns. A double-Lazy Parser (2LP) is then used to parse the document that exploits the internal pointers. To create the internal pointers, we use constructs supported by the current W3C XML standard. We study our pointer generation and parsing algorithms both theoretically and experimentally, and show that they perform considerably better than existing approaches.

[1]  XML parsing: a threat to database performance , 2003, CIKM '03.

[2]  Markus L. Noga,et al.  Lazy XSL transformations , 2003, DocEng '03.

[3]  Dan Suciu,et al.  Processing XML Streams with Deterministic Automata , 2003, ICDT.

[4]  Guido Moerkotte,et al.  Efficient Storage of XML Data , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[5]  Robert Richards Simple API for XML (SAX) , 2006 .

[6]  Massimo Franceschet XPathMark: An XPath Benchmark for the XMark Generated Data , 2005, XSym.

[7]  S. Abramsky The lazy lambda calculus , 1990 .

[8]  David Turner,et al.  Research topics in functional programming , 1990 .

[9]  David S. Burggraf Geography Markup Language , 2006, Data Sci. J..

[10]  W. A. Martin,et al.  Parsing , 1980, ACL.

[11]  Robert Richards,et al.  Document Object Model (DOM) , 2006 .

[12]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.

[13]  Hiroyuki Sato,et al.  Static optimization of XSLT stylesheets: template instantiation optimization and lazy XML parsing , 2005, ACM Symposium on Document Engineering.

[14]  Alon Itai,et al.  How to Pack Trees , 1999, J. Algorithms.

[15]  Georg Gottlob,et al.  Efficient Algorithms for Processing XPath Queries , 2002, VLDB.

[16]  Guido Moerkotte,et al.  A linear time algorithm for optimal tree sibling partitioning and approximation algorithms in Natix , 2006, VLDB.

[17]  J. V. Lunteren,et al.  XML Accelerator Engine , 2004 .

[18]  Welf Löwe,et al.  Lazy XML processing , 2002, DocEng '02.

[19]  Ioana Manolescu,et al.  A Benchmark for XML Data Management , 2002 .

[20]  Zoran Dimitrijevic,et al.  Quality of Service Support for Real-time Storage Systems , 2003 .

[21]  Guido Moerkotte,et al.  A Linear-Time Algorithm for Optimal Tree Sibling Partitioning and its Application to XML Data Stores , 2006 .

[22]  Oleg Kiselyov,et al.  A Better XML Parser through Functional Programming , 2002, PADL.