Nesting Strategies for Enabling Nimble MapReduce Dataflows for Large RDF Data

Graph and semi-structured data are usually modeled in relational processing frameworks as "thin" relations (node, edge, node) and processing such data involves a lot of join operations. Intermediate results of joins with multi-valued attributes or relationships, contain redundant subtuples due to repetition of single-valued attributes. The amount of redundant content is high for real-world multi-valued relationships in social network (millions of Twitter followers of popular celebrities) or biological (multiple references to related proteins) datasets. In MapReduce-based platforms such as Apache Hive and Pig, redundancy in intermediate results contributes avoidable costs to the overall I/O, sorting, and network transfer overhead of join-intensive workloads due to longer workflows. Consequently, providing techniques for dealing with such redundancy will enable more nimble execution of such workflows. This paper argues for the use of a nested data model for representing intermediate data concisely using nesting-aware dataflow operators that allow for lazy and partial unnesting strategies. This approach reduces the overall I/O and network footprint of a workflow by concisely representing intermediate results during most of a workflow's execution, until complete unnesting is absolutely necessary. The proposed strategies are integrated into Apache Pig and experimental evaluation over real-world and synthetic benchmark datasets confirms their superiority over relational-style MapReduce systems such as Apache Pig and Hive.

[1]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[2]  Nikos Mamoulis,et al.  Efficient processing of joins on set-valued attributes , 2003, SIGMOD '03.

[3]  Xiaodan Wang,et al.  CoScan: cooperative scan sharing in the cloud , 2011, SoCC.

[4]  Eduardo Mena,et al.  Answering Continuous Description Logic Queries: Managing Static and Volatile Knowledge in Ontologies , 2014, Int. J. Semantic Web Inf. Syst..

[5]  Amit P. Sheth,et al.  Semantic Services, Interoperability and Web Applications - Emerging Concepts , 2011, Semantic Services, Interoperability and Web Applications.

[6]  Bhavani M. Thuraisingham,et al.  Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing , 2011, IEEE Transactions on Knowledge and Data Engineering.

[7]  HyeongSik Kim,et al.  From SPARQL to MapReduce: The Journey Using a Nested TripleGroup Algebra , 2011, Proc. VLDB Endow..

[8]  Sven Schade,et al.  Data Integration in the Geospatial Semantic Web , 2009, J. Cases Inf. Technol..

[9]  Chengkai Li,et al.  Set Predicates in SQL: Enabling Set-Level Comparisons for Dynamically Formed Groups , 2014, IEEE Transactions on Knowledge and Data Engineering.

[10]  Brian McBride,et al.  Jena: A Semantic Web Toolkit , 2002, IEEE Internet Comput..

[11]  Jordi Conesa,et al.  Refactoring and its Application to Ontologies , 2011, Semantic Web Personalization and Context Awareness.

[12]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[13]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[14]  Cong Yu,et al.  Distributed cube materialization on holistic measures , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[15]  Catriel Beeri,et al.  An Introduction to the Completeness of Languages for Complex Objects and Nested Relations , 1987, NF².

[16]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Mohamed Medhat Gaber,et al.  Enabling Scalable Semantic Reasoning for Mobile Services , 2011, Semantic Services, Interoperability and Web Applications.

[19]  A. Sheth International Journal on Semantic Web & Information Systems , .

[20]  HyeongSik Kim,et al.  An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce , 2011, ESWC.

[21]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..

[22]  Marek Babiuch,et al.  Modeling Objects of Industrial Applications , 2009 .

[23]  Yannis Kalfoglou,et al.  Cases on Semantic Interoperability for Information Systems Integration - Practices and Applications , 2009, Cases on Semantic Interoperability for Information Systems Integration.

[24]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[25]  HyeongSik Kim,et al.  To nest or not to nest, when and how much: representing intermediate results of graph pattern queries in MapReduce based processing , 2012, SWIM '12.

[26]  Ina Fourie Semantic Web Personalization and Context Awareness: Management of Personal Identities and Social Networking , 2012 .

[27]  Ashraf Aboulnaga,et al.  ReStore: Reusing Results of MapReduce Jobs , 2012, Proc. VLDB Endow..

[28]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[29]  Christian Bizer,et al.  The Berlin SPARQL Benchmark , 2009, Int. J. Semantic Web Inf. Syst..

[30]  Marta Sabou,et al.  Crowdsourced Knowledge Acquisition: Towards Hybrid-Genre Workflows , 2013, Int. J. Semantic Web Inf. Syst..

[31]  Vinay Setty,et al.  Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[32]  Sven Helmer,et al.  Evaluation of Main Memory Join Algorithms for Joins with Set Comparison Join Predicates , 1996, VLDB.

[33]  Raghu Ramakrishnan,et al.  Sailfish: a framework for large scale data processing , 2012, SoCC '12.

[34]  Serge Abiteboul,et al.  Nested Relations and Complex Objects in Databases , 1989, Lecture Notes in Computer Science.

[35]  Chao Tian,et al.  Nova: continuous Pig/Hadoop workflows , 2011, SIGMOD '11.

[36]  Mladen A. Vouk,et al.  NCSU's Virtual Computing Lab: A Cloud Computing Solution , 2009, Computer.

[37]  Beng Chin Ooi,et al.  Query optimization for massively parallel data processing , 2011, SoCC.

[38]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[39]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[40]  Eva Oliveira,et al.  Handbook of Research on Social Dimensions of Semantic Technologies and Web Services , 2009 .