Parallelizing XML data-streaming workflows via MapReduce

In prior work it has been shown that the design of scientific workflows can benefit from a collection-oriented modeling paradigm which views scientific workflows as pipelines of XML stream processors. In this paper, we present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the MapReduce framework. Pipelines in our approach consist of sequences of processing steps that receive XML-structured data and produce, often through calls to ''black-box'' (scientific) functions, modified (i.e., updated) XML structures. Our main contributions are (i) the development of a set of strategies for compiling scientific workflows, modeled as XML processing pipelines, into parallel MapReduce networks, and (ii) a discussion of their advantages and trade-offs, based on a thorough experimental evaluation of the various translation strategies. Our evaluation uses the Hadoop MapReduce system as an implementation platform. Our results show that execution times of XML workflow pipelines can be significantly reduced using our compilation strategies. These efficiency gains, together with the benefits of MapReduce (e.g., fault tolerance) make our approach ideal for executing large-scale, compute-intensive XML-based scientific workflows.

[1]  Zhaohua Li TelegraphCQ : Continuous Dataflow Processing for an Uncertain World , 2006 .

[2]  Bertram Ludäscher,et al.  X-CSR: Dataflow Optimization for Distributed XML Process Pipelines , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[3]  Susan B. Davidson,et al.  An Efficient XPath Query Processor for XML Streams , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[4]  Natawut Nupairoj,et al.  The BPEL orchestrating framework for secured grid services , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[5]  Dan Suciu,et al.  XMLTK: An XML Toolkit for Scalable XML Stream Processing , 2002 .

[6]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[7]  Wolfgang Gentzsch,et al.  Sun Grid Engine: towards creating a compute power grid , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[8]  Jun Qin,et al.  Advanced data flow support for scientific grid workflow applications , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[9]  Stefanie Scherzinger,et al.  Schema-based Scheduling of Event Processors and Buffer Minimization for Queries on Structured Data Streams , 2004, VLDB.

[10]  Wenfei Fan,et al.  Incremental evaluation of schema-directed XML publishing , 2004, SIGMOD '04.

[11]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[12]  Ian J. Taylor,et al.  Triana Applications within Grid Computing and Peer to Peer Environments , 2003, Journal of Grid Computing.

[13]  Gilles Kahn,et al.  The Semantics of a Simple Language for Parallel Programming , 1974, IFIP Congress.

[14]  Dennis Gannon,et al.  Workflows for e-Science, Scientific Workflows for Grids , 2014 .

[15]  Vagelis Hristidis,et al.  Authority-based keyword search in databases , 2008, TODS.

[16]  Emmanuel Beffara,et al.  ICFP '03 Proceedings of the eighth ACM SIGPLAN international conference on Functional programming , 2003 .

[17]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[18]  Kevin P. Hinshaw,et al.  Distributed XQuery , 2004 .

[19]  Beng Chin Ooi,et al.  Proceedings of the 2007 ACM SIGMOD international conference on Management of data , 2007, SIGMOD 2007.

[20]  Shawn Bowers,et al.  An approach for pipelining nested collections in scientific workflows , 2005, SGMD.

[21]  Marcus Fontoura,et al.  Streaming XPath processing with forward and backward axes , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[22]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[23]  Stefanie Scherzinger,et al.  FluXQuery: An Optimizing XQuery Processor for Streaming XML Data , 2004, VLDB.

[24]  Geoffrey Fox,et al.  Special Issue: Workflow in Grid Systems , 2006, Concurr. Comput. Pract. Exp..

[25]  Daniel James Goodman,et al.  Introduction and evaluation of Martlet: a scientific workflow language for abstracted parallelisation , 2007, WWW '07.

[26]  Jun Qin,et al.  ASKALON: a Grid application development and computing environment , 2005, The 6th IEEE/ACM International Workshop on Grid Computing, 2005..

[27]  Christian Mathis,et al.  Node labeling schemes for dynamic XML documents reconsidered , 2007, Data Knowl. Eng..

[28]  Scott Klasky,et al.  Scientific Process Automation and Workflow Management , 2009 .

[29]  Shiyong Lu,et al.  A MapReduce-Enabled Scientific Workflow Composition Framework , 2009, 2009 IEEE International Conference on Web Services.

[30]  Dan Suciu,et al.  Stream processing of XPath queries with predicates , 2003, SIGMOD '03.

[31]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[32]  Patrick E. O'Neil,et al.  ORDPATHs: insert-friendly XML node labels , 2004, SIGMOD '04.

[33]  Jacek Sroka,et al.  A Formal Model of Dataflow Repositories , 2007, DILS.

[34]  Edward A. Lee,et al.  Implementing BPEL4WS: the architecture of a BPEL4WS implementation: Research Articles , 2006 .

[35]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[36]  Edward A. Lee,et al.  CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Taverna: Lessons in creating , 2022 .

[37]  Giuseppe Castagna,et al.  CDuce: an XML-centric general-purpose language , 2003, ACM SIGPLAN Notices.

[38]  Michael Stonebraker,et al.  Fault-tolerance in the Borealis distributed stream processing system , 2005, SIGMOD '05.

[39]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[40]  Yong Zhao,et al.  A notation and system for expressing and executing cleanly typed workflows on messy scientific data , 2005, SGMD.

[41]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[42]  Trevor Jim,et al.  Highly distributed XQuery with DXQ , 2007, SIGMOD '07.

[43]  Zaiqing Nie,et al.  Proceedings of the Ninth International Workshop on Information Integration on the Web , 2012 .

[44]  Scott Klasky,et al.  Workflow automation for processing plasma fusion simulation data , 2007, WORKS '07.

[45]  Bertram Ludäscher,et al.  Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data , 2006, DILS.

[46]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[47]  Dan Suciu,et al.  Processing XML streams with deterministic automata and stream indexes , 2004, TODS.

[48]  Ioana Manolescu,et al.  Dynamic XML documents with distribution and replication , 2003, SIGMOD '03.

[49]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[50]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[51]  Jussi Myllymaki,et al.  Implementing a scalable XML publish/subscribe system using relational database systems , 2004, SIGMOD '04.

[52]  Jody Condit Fagan Mashing up multiple web feeds using yahoo! pipes , 2007 .

[53]  Bertram Ludäscher,et al.  Scientific workflow design for mere mortals , 2009, Future Gener. Comput. Syst..

[54]  Steve Jones,et al.  Rocks clusters , 2006, SC.

[55]  Patricia J. Teller,et al.  Proceedings of the 2008 ACM/IEEE conference on Supercomputing , 2008, HiPC 2008.