Efficient social network data query processing on MapReduce

Social network data analysis becomes increasingly important for business intelligence and online social services. Lots of social network data is presented by Resource Description Framework (RDF). Accordingly, SPARQL, an RDF query language, becomes popular for social network data analysis. As the sizes of social networks expand rapidly, a SPARQL query usually involves a large quantity of data, and thus parallelizing its execution is desirable. MapReduce is a well-known and popular big data analysis tool. However, the state-of-the-art translation from SPARQL queries to MapReduce jobs is not efficient because it mainly follows a two layer rule which needs to transform the SPARQL triple pattern to the standard SQL join. In this paper, we propose two primitives to enable efficient translation from SPARQL queries to MapReduce jobs. We use multiple-join-with-filter to substitute traditional SQL multiple join when feasible, and merge different stages in the query workflow. The evaluation on social network data benchmarks shows that the translation based on these two primitives can achieve up to 2x speedup in query running time comparing to the traditional two layer scheme.

[1]  Feifei Li,et al.  Scalable Multi-query Optimization for SPARQL , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[2]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[3]  Wolfgang Nejdl,et al.  Benchmarking Fulltext Search Performance of RDF Stores , 2009, ESWC.

[4]  Jeff Heflin,et al.  LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..

[5]  Peter A. Boncz,et al.  S3G2: A Scalable Structure-Correlated Social Graph Generator , 2012, TPCTC.

[6]  Haixun Wang,et al.  A Distributed Graph Engine for Web Scale RDF Data , 2013, Proc. VLDB Endow..

[7]  Jianling Sun,et al.  Scalable RDF store based on HBase and MapReduce , 2010, 2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE).

[8]  Sang-goo Lee,et al.  SPARQL basic graph pattern processing with iterative MapReduce , 2010, MDAC '10.

[9]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[10]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[11]  Lars George,et al.  HBase: The Definitive Guide , 2011 .

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Peter Mika,et al.  Web Semantics in the Clouds , 2008, IEEE Intelligent Systems.

[14]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[15]  Fusheng Wang,et al.  YSmart: Yet Another SQL-to-MapReduce Translator , 2011, 2011 31st International Conference on Distributed Computing Systems.