Optimization of Complex SPARQL Analytical Queries

Analytical queries are crucial for many emerging Semantic Web applications such as clinical-trial recruiting in Life Sciences that incorporate patient and drug profile data. Such queries compare aggregates over multiple groupings of data which pose challenges in expression and optimization of complex grouping-aggregation constraints. While these challenges have been addressed in relational models, the semi-structured nature of RDF introduces additional challenges that need further investigation. Each grouping required in an RDF analytical query maps to a graph pattern subquery with related groups leading to overlapping graph patterns within the same query. The resulting algebraic expressions for such queries contain large numbers of joins, groupings and aggregations, posing significant challenges for present-day optimizers. In this paper, we propose an approach for supporting efficient and scalable RDF analytics that follows the well known technique of simplifying algebraic expressions of RDF analytical queries in a way that enables better optimization. Specifically, the approach is based on a refactoring of analytical queries expressed in the relational-like SPARQL algebra based on a new set of logical operators. This refactoring achieves shared execution of common subexpressions that enables parallel evaluation of groupings as well aggregations, leading to reduced I/O and processing costs, particularly beneficial for scale-out processing on distributed Cloud systems. Experiments on real-world and synthetic benchmarks confirm that such a rewriting can achieve up to 10X speedup over relational-style SPARQL query plans executed on popular Cloud systems.

[1]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[2]  Donald D. Chamberlin,et al.  Using the New DB2: IBM's Object-Relational Database System , 1996 .

[3]  Jeffrey F. Naughton,et al.  Adaptive parallel aggregation algorithms , 1995, SIGMOD '95.

[4]  Wolfgang Lehner,et al.  On-line analytical processing in distributed data warehouses , 1998, Proceedings. IDEAS'98. International Database Engineering and Applications Symposium (Cat. No.98EX156).

[5]  Daniel J. Abadi,et al.  Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[6]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[7]  Jens Lehmann,et al.  ReDD-Observatory: Using the Web of Data for Evaluating the Research-Disease Disparity , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[8]  HyeongSik Kim,et al.  From SPARQL to MapReduce: The Journey Using a Nested TripleGroup Algebra , 2011, Proc. VLDB Endow..

[9]  Benedikt Kämpgen,et al.  Interacting with Statistical Linked Data via OLAP Operations , 2012, ILD@ESWC.

[10]  Fusheng Wang,et al.  YSmart: Yet Another SQL-to-MapReduce Translator , 2011, 2011 31st International Conference on Distributed Computing Systems.

[11]  HyeongSik Kim,et al.  An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce , 2011, ESWC.

[12]  Christopher Olston,et al.  Parallel Evaluation of Composite Aggregate Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[13]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[14]  Padmashree Ravindra,et al.  RAPID: Enabling Scalable Ad-Hoc Analytics on the Semantic Web , 2009, SEMWEB.

[15]  Patrick E. O'Neil,et al.  Improved query performance with variant indexes , 1997, SIGMOD '97.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Padmashree Ravindra,et al.  Scaling Unbound-Property Queries on Big RDF Data Warehouses using MapReduce , 2015, EDBT.

[18]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[19]  Feifei Li,et al.  Scalable Multi-query Optimization for SPARQL , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[20]  Michael H. Böhlen,et al.  Generalized MD-Joins: Evaluation and Reduction to SQL , 2001, Databases in Telecommunications.

[21]  Bin Chen,et al.  Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data , 2010, BMC Bioinformatics.

[22]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..

[23]  Laks V. S. Lakshmanan,et al.  Efficient OLAP query processing in distributed data warehouses , 2002, Proceedings 18th International Conference on Data Engineering.

[24]  Theodore Johnson,et al.  The MD-join: an operator for complex OLAP , 2001, Proceedings 17th International Conference on Data Engineering.

[25]  Cong Yu,et al.  Distributed cube materialization on holistic measures , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[26]  Damianos Chatziantoniou,et al.  ASSET queries: a declarative alternative to MapReduce , 2009, SGMD.

[27]  Lorena Etcheverry,et al.  Enhancing OLAP Analysis with Web Cubes , 2012, ESWC.

[28]  Kei-Hoi Cheung,et al.  AlzPharm: integration of neurodegeneration data using RDF , 2007, BMC Bioinformatics.

[29]  Surajit Chaudhuri,et al.  On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases , 1998, KDD.

[30]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[31]  François Goasdoué,et al.  RDF analytics: lenses over semantic graphs , 2014, WWW.

[32]  Mladen A. Vouk,et al.  NCSU's Virtual Computing Lab: A Cloud Computing Solution , 2009, Computer.

[33]  Alejandro P. Buchmann,et al.  Encoded bitmap indexing for data warehouses , 1998, Proceedings 14th International Conference on Data Engineering.

[34]  Nicole Tourigny,et al.  Bio2RDF: Towards a mashup to build bioinformatics knowledge systems , 2008, J. Biomed. Informatics.