2LP: A double-lazy XML parser

XML is acknowledged as the most effective format for data encoding and exchange over domains ranging from the World Wide Web to desktop applications. However, large-scale adoption into actual system implementations is being slowed down due to the inefficiency of its document-parsing methods. The recent development of lazy parsing techniques is a major step towards improving this situation, but lazy parsers still have a key drawback-they must load the entire XML document in order to extract the overall document structure before document parsing can be performed. We have developed a framework for efficient parsing based on the idea of placing internal physical pointers within the XML document that allow the navigation process to skip large portions of the document during parsing. We show how to generate such internal pointers in a way that optimizes parsing using constructs supported by the current W3C XML standard. A double-lazy parser (2LP) exploits these internal pointers to efficiently parse the document. The usage of supported W3C constructs to create internal pointers allows 2LP to be backward compatible-i.e., the pointer-augmented documents can be parsed by current XML parsers. We also implemented a mechanism to efficiently parse large documents with limited main memory, thereby overcoming a major limitation in current solutions. We study our pointer generation and parsing algorithms both theoretically and experimentally, and show that they perform considerably better than existing approaches.

[1]  Markus L. Noga,et al.  Lazy XSL transformations , 2003, DocEng '03.

[2]  S. Abramsky The lazy lambda calculus , 1990 .

[3]  XML parsing: a threat to database performance , 2003, CIKM '03.

[4]  Oleg Kiselyov,et al.  A Better XML Parser through Functional Programming , 2002, PADL.

[5]  Guido Moerkotte,et al.  A linear time algorithm for optimal tree sibling partitioning and approximation algorithms in Natix , 2006, VLDB.

[6]  Raymond K. Wong,et al.  Querying and maintaining a compact XML storage , 2007, WWW '07.

[7]  Sebastian Maneth,et al.  Efficient Memory Representation of XML Documents , 2005, DBPL.

[8]  Wei Lu,et al.  A Parallel Approach to XML Parsing , 2006, 2006 7th IEEE/ACM International Conference on Grid Computing.

[9]  Hahn-Ming Lee,et al.  XML Evolution: a two-phase XML processing model using XML prefiltering techniques , 2006, VLDB.

[10]  Torsten. Grust,et al.  Accelerating XPath location steps , 2002, SIGMOD '02.

[11]  Dan Suciu,et al.  Processing XML Streams with Deterministic Automata , 2003, ICDT.

[12]  Jayant R. Haritsa,et al.  XGrind: a query-friendly XML compressor , 2002, Proceedings 18th International Conference on Data Engineering.

[13]  Hahn-Ming Lee,et al.  Prefiltering techniques for efficient XML document processing , 2005, DocEng '05.

[14]  Zoran Dimitrijevic,et al.  Quality of Service Support for Real-time Storage Systems , 2003 .

[15]  Bernd Bruegge,et al.  Object-Oriented Software Engineering: Using UML, Patterns and Java, Second Edition , 2003 .

[16]  Welf Löwe,et al.  Lazy XML processing , 2002, DocEng '02.

[17]  Abraham Silberschatz,et al.  Operating System Concepts 7th Edition with Java 7th Edition , 2006 .

[18]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[19]  Cong Yu,et al.  TIMBER: A native XML database , 2002, The VLDB Journal.

[20]  Bernd Bruegge,et al.  Object-Oriented Software Engineering Using UML, Patterns, and Java , 2009 .

[21]  Naila Rahman,et al.  Engineering succinct DOM , 2008, EDBT '08.

[22]  David Turner,et al.  Research topics in functional programming , 1990 .

[23]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.

[24]  Massimo Franceschet XPathMark: An XPath Benchmark for the XMark Generated Data , 2005, XSym.

[25]  John Wilkes,et al.  An introduction to disk drive modeling , 1994, Computer.

[26]  Sato Hiroyuki,et al.  Static optimization of XSLT stylesheets: template instantiation optimization and lazy XML parsing , 2005 .

[27]  Fabrizio Luccio,et al.  Compressing and searching XML data via two zips , 2006, WWW '06.

[28]  Dan Suciu,et al.  XMill: an efficient compressor for XML data , 2000, SIGMOD 2000.