Parallelizing XML Processing Pipelines via MapReduce

We present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the MapReduce framework. Pipelines in our approach consist of sequences of processing steps that consume XML-structured data and produce, often through calls to “black-box” functions, modified (i.e., updated) XML structures. Our main contributions are a set of strategies for compiling such XML pipelines into parallel MapReduce networks and a discussion of their advantages and tradeoffs. We present a detailed experimental evaluation of these approaches using the Hadoop MapReduce system as our implementation platform. Our results show that execution times of XML pipelines can be significantly reduced using our compilation strategies. These efficiency gains, together with the benefits of MapReduce (e.g., fault tolerance) make our approach ideal for executing largescale, compute-intensive XML processing pipelines.

[1]  Geoffrey Fox,et al.  Special Issue: Workflow in Grid Systems , 2006, Concurr. Comput. Pract. Exp..

[2]  Dan Suciu,et al.  XMLTK: An XML Toolkit for Scalable XML Stream Processing , 2002 .

[3]  Kevin P. Hinshaw,et al.  Distributed XQuery , 2004 .

[4]  Jun Qin,et al.  Advanced data flow support for scientific grid workflow applications , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[5]  Natawut Nupairoj,et al.  The BPEL orchestrating framework for secured grid services , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[6]  Stefanie Scherzinger,et al.  FluXQuery: An Optimizing XQuery Processor for Streaming XML Data , 2004, VLDB.

[7]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[8]  Bertram Ludäscher,et al.  X-CSR: Dataflow Optimization for Distributed XML Process Pipelines , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[9]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[10]  Giuseppe Castagna,et al.  CDuce: an XML-centric general-purpose language , 2003, ACM SIGPLAN Notices.

[11]  Patrick E. O'Neil,et al.  ORDPATHs: insert-friendly XML node labels , 2004, SIGMOD '04.

[12]  Marcus Fontoura,et al.  Streaming XPath processing with forward and backward axes , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[13]  Christian Mathis,et al.  Node labeling schemes for dynamic XML documents reconsidered , 2007, Data Knowl. Eng..

[14]  Susan B. Davidson,et al.  An Efficient XPath Query Processor for XML Streams , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[15]  Gilles Kahn,et al.  The Semantics of a Simple Language for Parallel Programming , 1974, IFIP Congress.

[16]  Steve Jones,et al.  Rocks clusters , 2006, SC.

[17]  Yong Zhao,et al.  A notation and system for expressing and executing cleanly typed workflows on messy scientific data , 2005, SGMD.

[18]  Trevor Jim,et al.  Highly distributed XQuery with DXQ , 2007, SIGMOD '07.

[19]  Stefanie Scherzinger,et al.  Schema-based Scheduling of Event Processors and Buffer Minimization for Queries on Structured Data Streams , 2004, VLDB.

[20]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[21]  Jacek Sroka,et al.  A Formal Model of Dataflow Repositories , 2007, DILS.

[22]  Jody Condit Fagan Mashing up multiple web feeds using yahoo! pipes , 2007 .

[23]  Dennis Gannon,et al.  Workflows for e-Science, Scientific Workflows for Grids , 2014 .

[24]  Michael Stonebraker,et al.  Fault-tolerance in the Borealis distributed stream processing system , 2005, SIGMOD '05.

[25]  Edward A. Lee,et al.  Dataflow process networks , 1995, Proc. IEEE.

[26]  Wolfgang Gentzsch,et al.  Sun Grid Engine: towards creating a compute power grid , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[27]  Dan Suciu,et al.  Stream processing of XPath queries with predicates , 2003, SIGMOD '03.

[28]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[29]  Jussi Myllymaki,et al.  Implementing a scalable XML publish/subscribe system using relational database systems , 2004, SIGMOD '04.

[30]  Ioana Manolescu,et al.  Dynamic XML documents with distribution and replication , 2003, SIGMOD '03.

[31]  Scott Klasky,et al.  Workflow automation for processing plasma fusion simulation data , 2007, WORKS '07.

[32]  Dan Suciu,et al.  Processing XML streams with deterministic automata and stream indexes , 2004, TODS.

[33]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[34]  Shawn Bowers,et al.  An approach for pipelining nested collections in scientific workflows , 2005, SGMD.

[35]  G. Alonso,et al.  Parallel computing patterns for Grid workflows , 2006, 2006 Workshop on Workflows in Support of Large-Scale Science.

[36]  Edward A. Lee,et al.  CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Taverna: Lessons in creating , 2022 .

[37]  Jun Qin,et al.  ASKALON: a Grid application development and computing environment , 2005, The 6th IEEE/ACM International Workshop on Grid Computing, 2005..