Learning-based SPARQL query performance modeling and prediction

One of the challenges of managing an RDF database is predicting performance of SPARQL queries before they are executed. Performance characteristics, such as the execution time and memory usage, can help data consumers identify unexpected long-running queries before they start and estimate the system workload for query scheduling. Extensive works address such performance prediction problem in traditional SQL queries but they are not directly applicable to SPARQL queries. In this paper, we adopt machine learning techniques to predict the performance of SPARQL queries. Our work focuses on modeling features of a SPARQL query to a vector representation. Our feature modeling method does not depend on the knowledge of underlying systems and the structure of the underlying data, but only on the nature of SPARQL queries. Then we use these features to train prediction models. We propose a two-step prediction process and consider performances in both cold and warm stages. Evaluations are performed on real world SPRAQL queries, whose execution time ranges from milliseconds to hours. The results demonstrate that the proposed approach can effectively predict SPARQL query performance and outperforms state-of-the-art approaches.

[1]  Thomas Neumann,et al.  Exploiting the query structure for efficient join ordering in SPARQL queries , 2014, EDBT.

[2]  François Goasdoué,et al.  Optimizing Reformulation-based Query Answering in RDF , 2015, EDBT.

[3]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[4]  Dave Reynolds,et al.  SPARQL basic graph pattern optimization using selectivity estimation , 2008, WWW.

[5]  Shivnath Babu,et al.  Predicting completion times of batch query workloads using interaction-aware models and simulation , 2011, EDBT/ICDT '11.

[6]  Vassilis Christophides,et al.  Heuristics-based query optimisation for SPARQL , 2012, EDBT '12.

[7]  Tim Brecht,et al.  Q-Cop: Avoiding bad query mixes to minimize client timeouts under heavy loads , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[8]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[9]  Eli Upfal,et al.  Performance prediction for concurrent database workloads , 2011, SIGMOD '11.

[10]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[11]  Rakebul Hasan,et al.  Predicting SPARQL Query Performance and Explaining Linked Data , 2014, ESWC.

[12]  Surajit Chaudhuri,et al.  Robust Estimation of Resource Consumption for SQL Queries using Statistical Techniques , 2012, Proc. VLDB Endow..

[13]  Archana Ganapathi,et al.  Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[14]  Marcelo Arenas,et al.  Semantics and complexity of SPARQL , 2006, TODS.

[15]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[16]  Guido Moerkotte,et al.  Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[17]  Quan Z. Sheng,et al.  Searching the Big Data: Practices and Experiences in Efficiently Querying Knowledge Bases , 2017, Handbook of Big Data Technologies.

[18]  Quan Z. Sheng,et al.  Identifying and Caching Hot Triples for Efficient RDF Query Processing , 2015, DASFAA.

[19]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[20]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[21]  Jens Lehmann,et al.  Usage-Centric Benchmarking of RDF Triple Stores , 2012, AAAI.

[22]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[23]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[24]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[25]  Eli Upfal,et al.  Learning-based Query Performance Modeling and Prediction , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[26]  Ulf Leser,et al.  Querying Distributed RDF Data Sources with SPARQL , 2008, ESWC.

[27]  Jeffrey F. Naughton,et al.  Predicting query execution time: Are optimizer cost models really unusable? , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).