Pay-as-you-go Approximate Join Top-k Processing for the Web of Data

For effectively searching the Web of data, ranking of results is a crucial. Top-k processing strategies have been proposed to allow an efficient processing of such ranked queries. Top-k strategies aim at computing k top-ranked results without complete result materialization. However, for many applications result computation time is much more important than result accuracy and completeness. Thus, there is a strong need for approximated ranked results. Unfortunately, previous work on approximate top-k processing is not well-suited for the Web of data. In this paper, we propose the first approximate top-k join framework for Web data and queries. Our approach is very lightweight – necessary statistics are learned at runtime in a pay-as-you-go manner. We conducted extensive experiments on state-of-art SPARQL benchmarks. Our results are very promising: we could achieve up to 65% time savings, while maintaining a high precision/recall.

[1]  Lora Aroyo,et al.  The Semantic Web - ISWC 2011 - 10th International Semantic Web Conference, Bonn, Germany, October 23-27, 2011, Proceedings, Part I , 2011, SEMWEB.

[2]  Amit P. Sheth,et al.  Graph Summaries for Subgraph Frequency Estimation , 2008, ESWC.

[3]  R. Varshney,et al.  Supporting top-k join queries in relational databases , 2011 .

[4]  Kevin Chen-Chuan Chang,et al.  RankSQL: query algebra and optimization for relational top-k queries , 2005, SIGMOD '05.

[5]  Dimitrios Gunopulos,et al.  Anytime measures for top-k algorithms on exact and fuzzy data sets , 2009, The VLDB Journal.

[6]  Jeff Heflin,et al.  The Semantic Web – ISWC 2012 , 2012, Lecture Notes in Computer Science.

[7]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[8]  Gerhard Weikum,et al.  Probabilistic information retrieval approach for ranking of database query results , 2006, TODS.

[9]  Yehoshua Sagiv,et al.  Incrementally Computing Ordered Answers of Acyclic Conjunctive Queries , 2006, NGITS.

[10]  Georg Lausen,et al.  SP2Bench: A SPARQL Performance Benchmark , 2008, Semantic Web Information Management.

[11]  Lora Aroyo,et al.  The Semantic Web: Research and Applications , 2009, Lecture Notes in Computer Science.

[12]  Georg Lausen,et al.  SP^2Bench: A SPARQL Performance Benchmark , 2008, 2009 IEEE 25th International Conference on Data Engineering.

[13]  Ralf Rantzau,et al.  Context-sensitive ranking , 2006, SIGMOD Conference.

[14]  Man Lung Yiu,et al.  Efficient top-k aggregation of ranked inputs , 2007, TODS.

[15]  John R. Smith,et al.  Supporting Incremental Join Queries on Ranked Inputs , 2001, VLDB.

[16]  Gerhard Weikum,et al.  Scalable join processing on very large RDF graphs , 2009, SIGMOD Conference.

[17]  Peter D. Hoff,et al.  A First Course in Bayesian Statistical Methods , 2009 .

[18]  Gerhard Weikum,et al.  Best-Effort Top-k Query Processing Under Budgetary Constraints , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[19]  Andreas Harth,et al.  Top-k Linked Data Query Processing , 2012, ESWC.

[20]  Xuemin Lin,et al.  SPARK2: Top-k Keyword Query in Relational Databases , 2007, IEEE Transactions on Knowledge and Data Engineering.

[21]  Guido Moerkotte,et al.  Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[22]  Jens Lehmann,et al.  DBpedia SPARQL Benchmark - Performance Assessment with Real Queries on Real Data , 2011, SEMWEB.

[23]  Torsten Suel,et al.  Efficient query processing in geographic web search engines , 2006, SIGMOD Conference.

[24]  Sandra Lowe,et al.  Probability A Graduate Course , 2016 .

[25]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[26]  Emanuele Della Valle,et al.  Efficient Execution of Top-K SPARQL Queries , 2012, SEMWEB.

[27]  Gerhard Weikum,et al.  Database Foundations for Scalable RDF Processing , 2011, Reasoning Web.

[28]  R. Forthofer,et al.  Rank Correlation Methods , 1981 .

[29]  Neoklis Polyzotis,et al.  Robust and efficient algorithms for rank join evaluation , 2009, SIGMOD Conference.

[30]  Christian S. Jensen,et al.  Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects , 2009, Proc. VLDB Endow..

[31]  Gerhard Weikum,et al.  KLEE: A Framework for Distributed Top-k Query Algorithms , 2005, VLDB.

[32]  Davide Martinenghi,et al.  Cost-Aware Rank Join with Random and Sorted Access , 2012, IEEE Transactions on Knowledge and Data Engineering.

[33]  Feifei Li,et al.  Top-k queries on temporal data , 2010, The VLDB Journal.

[34]  Gerhard Weikum,et al.  Top-k Query Evaluation with Probabilistic Guarantees , 2004, VLDB.

[35]  Dave Reynolds,et al.  SPARQL basic graph pattern optimization using selectivity estimation , 2008, WWW.

[36]  J. Norris Appendix: probability and measure , 1997 .

[37]  Sharma Chakravarthy,et al.  One Size Does Not Fit All: Toward User- and Query-Dependent Ranking for Web Databases , 2012, IEEE Transactions on Knowledge and Data Engineering.

[38]  Neoklis Polyzotis,et al.  Evaluating rank joins with optimal cost , 2008, PODS.

[39]  Jeffrey F. Naughton,et al.  Toward scalable keyword search over relational data , 2010, Proc. VLDB Endow..

[40]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[41]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[42]  Dimitrios Gunopulos,et al.  Anytime Measures for Top-k Algorithms , 2007, VLDB.