PAXQuery Flink XQuery parser Logical plan optimizer Logical to PACT translator Logical plan Optimized logical plan Optimized Flink plan PACT plan Flink optimizer Flink

XQuery is a general-purpose programming language for processing semi-structured data, and as such, it is very expressive. As a consequence, optimizing and parallelizing complex analytics XQuery queries is still an open, challenging problem. We demonstrate PAXQuery, a novel system that parallelizes the execution of XQuery queries over large collections of XML documents. PAXQuery compiles a rich subset of XQuery into plans expressed in the PArallelization ConTracts (PACT) programming model. Thanks to this translation, the resulting plans are optimized and executed in a massively parallel fashion by the Apache Flink system. The result is a scalable system capable of querying massive amounts of XML data very efficiently, as proved by the experimental results we outline.

[1]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[2]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.

[3]  Dan Suciu,et al.  Containment and equivalence for an XPath fragment , 2002, PODS.

[4]  Alin Deutsch,et al.  The NEXT Logical Framework for XQuery , 2004, VLDB.

[5]  Ioana Manolescu,et al.  Algebra-Based Identification of Tree Patterns in XQuery , 2006, FQAS.

[6]  Christopher Ré,et al.  A Complete and Efficient Algebraic Compiler for XQuery , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[7]  Torsten Grust,et al.  MonetDB/XQuery: a fast XQuery processor powered by a relational engine , 2006, SIGMOD Conference.

[8]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[9]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[10]  Dominic Battré,et al.  Nephele/PACTs: a programming model and execution framework for web-scale analytical processing , 2010, SoCC '10.

[11]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[12]  Mariano P. Consens,et al.  Having a ChuQL at XML on the Cloud , 2011, AMW.

[13]  Kyong-Ha Lee,et al.  HadoopXML: a suite for parallel processing of massive XML data with multiple twig pattern queries , 2012, CIKM '12.

[14]  Astrid Rheinländer,et al.  Opening the Black Boxes in Data Flow Optimization , 2012, Proc. VLDB Endow..

[15]  Dario Colazzo,et al.  Processing XML queries and updates on map/reduce clusters , 2013, EDBT '13.

[16]  Ioana Manolescu,et al.  PAXQuery: Efficient Parallel Processing of Complex XQuery , 2014, IEEE Transactions on Knowledge and Data Engineering.

[17]  Ioana Manolescu,et al.  XML Tuple Algebra , 2018, Encyclopedia of Database Systems.