Towards scalable RDF graph analytics on MapReduce

In order to exploit the growing amount of RDF data in decision-making, there is an increasing demand for analytics-style processing of such data. RDF data is modeled as a labeled graph that represents a collection of binary relations (triples). In this context, analytical queries can be interpreted as consisting of three main constructs namely pattern matching, grouping and aggregation, and require several join operations to reassemble them into n-ary relations relevant to the given query, unlike traditional OLAP systems where data is suitably organized. MapReduce-based parallel processing systems like Pig have gained success in processing scalable analytical workloads. However, these systems offer only relational algebra style operators which would require an iterative n-tuple reassembly process in which intermediate results need to be materialized. This leads to high I/O costs that negatively impacts performance. In this paper, we propose UDFs that (i) re-factor analytical processing on RDF graphs in a way that enables more parallelized processing (ii) perform a look-ahead processing to reduce the cost of subsequent operators in the query execution plan. These functions have been integrated into the Pig Latin function library and the experimental results show up to 50% improvement in execution times for certain classes of queries. An important impact of this work is that it could serve as the foundation for additional physical operators in systems such as Pig for more efficient graph processing.

[1]  Jane Hunter,et al.  A scale-out RDF molecule store for distributed processing of biomedical data , 2008, WWW 2008.

[2]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[3]  Padmashree Ravindra,et al.  RAPID: Enabling Scalable Ad-Hoc Analytics on the Semantic Web , 2009, SEMWEB.

[4]  Gerhard Weikum,et al.  Scalable join processing on very large RDF graphs , 2009, SIGMOD Conference.

[5]  Jane Hunter,et al.  Scalable Semantics – The Silver Lining of Cloud Computing , 2008, 2008 IEEE Fourth International Conference on eScience.

[6]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[7]  Daniel J. Abadi,et al.  Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[8]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[9]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..

[12]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[13]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[14]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[15]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[16]  Frank van Harmelen,et al.  Scalable Distributed Reasoning Using MapReduce , 2009, SEMWEB.

[17]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..