Parsing XML using parallel traversal of streaming trees

XML has been widely adopted across a wide spectrum of applications.Its parsing efficiency, however, remains a concern, and can be a bottleneck.With the current trend towards multicore CPUs, parallelization to improve performanceis increasingly relevant. In many applications, the XML is streamedfrom the network, and thus the complete XML document is never in memory atany single moment in time. Parallel parsing of such a stream can be equated toparallel depth-first traversal of a streaming tree. Existing research on parallel treetraversal has assumed the entire tree was available in-memory, and thus cannotbe directly applied. In this paper we investigate parallel, SAX-style parsing ofXML via a parallel, depth-first traversal of the streaming document. We showgood scalability up to about 6 cores on a Linux platform.

[1]  G.Z. Qadah,et al.  Parallel processing of XML databases , 2005, Canadian Conference on Electrical and Computer Engineering, 2005..

[2]  Madhusudhan Govindaraju,et al.  Investigating the limits of SOAP performance for scientific computing , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[3]  Wei Lu,et al.  A Parallel Approach to XML Parsing , 2006, 2006 7th IEEE/ACM International Conference on Grid Computing.

[4]  Michiaki Tatsubori,et al.  An adaptive, fast, and safe XML parser based on byte sequences memorization , 2005, WWW '05.

[5]  Jaime Prilusky,et al.  The Protein Data Bank: Current Status and Future Challenges , 1996, Journal of research of the National Institute of Standards and Technology.

[6]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[7]  Vipin Kumar,et al.  Parallel depth first search. Part I. Implementation , 1987, International Journal of Parallel Programming.

[8]  A. Reinefeld,et al.  Work-load balancing in highly parallel depth-first search , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[9]  Joe Marini,et al.  Document Object Model , 2002, Encyclopedia of GIS.

[10]  Ying Zhang,et al.  A Static Load-Balancing Scheme for Parallel XML Parsing on Multicore CPUs , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[11]  Ying Zhang,et al.  Parallel XML Parsing Using Meta-DFAs , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[12]  Kam-Fai Wong,et al.  WIN: an efficient data placement strategy for parallel XML databases , 2005, 11th International Conference on Parallel and Distributed Systems (ICPADS'05).

[13]  Ying Zhang,et al.  Simultaneous transducers for data-parallel XML parsing , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[14]  Abraham Heifets,et al.  XML screamer: an integrated approach to high performance XML parsing, validation and deserialization , 2006, WWW '06.

[15]  Wei Zhang,et al.  Benchmarking XML Processors for Applications in Grid Web Services , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[16]  Robert A. van Engelen,et al.  Constructing Finite State Automata for High-Performance XML Web Services , 2004, International Conference on Internet Computing.

[17]  Wei Lu,et al.  A binary XML for scientific applications , 2005, First International Conference on e-Science and Grid Computing (e-Science'05).

[18]  Wei Zhang,et al.  A Table-Driven Streaming XML Parsing Methodology for High-Performance Web Services , 2006, 2006 IEEE International Conference on Web Services (ICWS'06).