Distributed Tree-Pattern Matching in Big Data Analytics Systems

Big data analytics systems such as Apache Spark offer built-in support for nested data, which abounds, for instance, as JSON data available online. However, these systems typically have to transform the data to gain access to (deeply) nested data for further processing. This adds complexity to big data analytics pipelines and may result in an unnecessary runtime overhead. Therefore, this paper introduces tree-pattern matching as a first-class operator in big data analytics systems. It reduces the complexity of big data analytics pipelines and accelerates the pipeline processing by up to four times, compared to state-of-the-art pipelines for nested data. The novelty of our operator lies in the distributed and data-parallel processing supported by its underlying tree-pattern matching algorithm. Experiments validate that our operator, implemented in Spark, can improve pipeline complexity and runtime.

[1]  Reynold Xin,et al.  Apache Spark , 2016 .

[2]  Jignesh M. Patel,et al.  Structural joins: a primitive for efficient XML query pattern matching , 2002, Proceedings 18th International Conference on Data Engineering.

[3]  Michael Ley,et al.  DBLP - Some Lessons Learned , 2009, Proc. VLDB Endow..

[4]  Tok Wang Ling,et al.  Indexing and querying XML using extended Dewey labeling scheme , 2011, Data Knowl. Eng..

[5]  David J. DeWitt,et al.  On supporting containment queries in relational database management systems , 2001, SIGMOD '01.

[6]  Melanie Herschel,et al.  Tracing nested data with structural provenance for big data analytics , 2020, EDBT.

[7]  Tok Wang Ling,et al.  Efficient processing of XML twig patterns with parent child edges: a look-ahead approach , 2004, CIKM '04.

[8]  Shimin Chen,et al.  Exploiting Common Patterns for Tree-Structured Data , 2017, SIGMOD Conference.

[9]  Theo Härder,et al.  S3: Evaluation of Tree-Pattern Queries Supported by Structural Summaries , 2009, Data Knowl. Eng..

[10]  Truls Amundsen Bjørklund,et al.  Fast optimal twig joins , 2010, Proc. VLDB Endow..

[11]  Tok Wang Ling,et al.  From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching , 2005, VLDB.

[12]  Xin Wu,et al.  XML twig pattern matching using version tree , 2008, Data Knowl. Eng..

[13]  Jérôme Darmont,et al.  A Survey of XML Tree Patterns , 2017, IEEE Transactions on Knowledge and Data Engineering.

[14]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[15]  Maurice Tchoupé Tchendji,et al.  A Tree Pattern Matching Algorithm for XML Queries with Structural Preferences , 2019, ArXiv.

[16]  Chen Wang,et al.  Extended XML Tree Pattern Matching: Theories and Algorithms , 2011, IEEE Transactions on Knowledge and Data Engineering.

[17]  Jeffrey D. Ullman,et al.  Storing and Querying Tree-Structured Records in Dremel , 2014, Proc. VLDB Endow..

[18]  Mohand Boughanem,et al.  A survey on tree matching and XML retrieval , 2013, Comput. Sci. Rev..

[19]  Divesh Srivastava,et al.  Holistic twig joins: optimal XML pattern matching , 2002, SIGMOD '02.

[20]  Melanie Herschel,et al.  Capturing and Querying Structural Provenance in Spark with Pebble , 2019, SIGMOD Conference.

[21]  Tova Milo,et al.  Towards Tractable Algebras for Bags , 1996, J. Comput. Syst. Sci..