Database Foundations for Scalable RDF Processing

As more and more data is provided in RDF format, storing huge amounts of RDF data and efficiently processing queries on such data is becoming increasingly important. The first part of the lecture will introduce state-of-the-art techniques for scalably storing and querying RDF with relational systems, including alternatives for storing RDF, efficient index structures, and query optimization techniques. As centralized RDF repositories have limitations in scalability and failure tolerance, decentralized architectures have been proposed. The second part of the lecture will highlight system architectures and strategies for distributed RDF processing. We cover search engines as well as federated query processing, highlight differences to classic federated database systems, and discuss efficient techniques for distributed query processing in general and for RDF data in particular. Moreover, for the last part of this chapter, we argue that extracting knowledge from the Web is an excellent showcase - and potentially one of the biggest challenges - for the scalable management of uncertain data we have seen so far. The third part of the lecture is thus intended to provide a close-up on current approaches and platforms to make reasoning (e.g., in the form of probabilistic inference) with uncertain RDF data scalable to billions of triples.

[1]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[2]  Dan Suciu,et al.  Query evaluation with soft-key constraints , 2008, PODS.

[3]  Daniel J. Abadi,et al.  Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[4]  Jeff Z. Pan,et al.  An Argument-Based Approach to Using Multiple Ontologies , 2009, SUM.

[5]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[6]  Paul Buitelaar,et al.  OntoSelect: A Dynamic Ontology Library with Support for Ontology Selection , 2004 .

[7]  Olaf Hartig,et al.  A Database Perspective on Consuming Linked Data on the Web , 2010, Datenbank-Spektrum.

[8]  Gerhard Weikum,et al.  SOFIE: a self-organizing framework for information extraction , 2009, WWW '09.

[9]  Stefan Decker,et al.  Sig.ma: Live views on the Web of Data , 2010, J. Web Semant..

[10]  Rajasekar Krishnamurthy,et al.  Uncertainty management in rule-based information extraction systems , 2009, SIGMOD Conference.

[11]  Toshiyuki Amagasa,et al.  An Indexing Scheme for RDF and RDF Schema based on Suffix Arrays , 2003, SWDB.

[12]  Tore Risch,et al.  EDUTELLA: a P2P networking infrastructure based on RDF , 2002, WWW.

[13]  Serge Abiteboul,et al.  On the representation and querying of sets of possible worlds , 1987, SIGMOD '87.

[14]  Heiner Stuckenschmidt,et al.  Index structures and algorithms for querying distributed RDF repositories , 2004, WWW '04.

[15]  Gerhard Weikum,et al.  RDF-3X: a RISC-style engine for RDF , 2008, Proc. VLDB Endow..

[16]  Ulf Leser,et al.  Querying Distributed RDF Data Sources with SPARQL , 2008, ESWC.

[17]  Ben Taskar,et al.  Introduction to statistical relational learning , 2007 .

[18]  Olaf Hartig,et al.  The SPARQL Query Graph Model for Query Optimization , 2007, ESWC.

[19]  Janusz Kacprzyk,et al.  Networked Knowledge - Networked Media , 2009 .

[20]  Ian Horrocks,et al.  The Semantic Web – ISWC 2010: 9th International Semantic Web Conference, ISWC 2010, Shanghai, China, November 7-11, 2010, Revised Selected Papers, Part I , 2010, SEMWEB.

[21]  Norbert Fuhr,et al.  Adding Probabilities and Rules to Owl Lite Subsets Based on Probabilistic Datalog , 2006, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[22]  Sherif Sakr,et al.  Relational processing of RDF queries: a survey , 2010, SGMD.

[23]  Peter J. Haas,et al.  MCDB-R , 2010, Proc. VLDB Endow..

[24]  Dan Suciu,et al.  The dichotomy of conjunctive queries on probabilistic structures , 2006, PODS.

[25]  James A. Hendler,et al.  Matrix "Bit" loaded: a scalable lightweight join query processor for RDF data , 2010, WWW '10.

[26]  Jennifer Widom,et al.  Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[27]  Thomas Lukasiewicz,et al.  Probabilistic description logic programs , 2005, Int. J. Approx. Reason..

[28]  Brigitte Jaumard,et al.  On the Complexity of the Maximum Satisfiability Problem for Horn Formulas , 1987, Inf. Process. Lett..

[29]  Peter Green,et al.  Markov chain Monte Carlo in Practice , 1996 .

[30]  Dan Olteanu,et al.  MayBMS: a probabilistic database management system , 2009, SIGMOD Conference.

[31]  Jürgen Umbrich,et al.  Data summaries for on-demand queries over linked data , 2010, WWW '10.

[32]  Nicholas Gibbins,et al.  3store: Efficient Bulk RDF Storage , 2003, PSSS.

[33]  Peter J. Haas,et al.  E = MC3: managing uncertain enterprise data in a cluster-computing environment , 2009, SIGMOD Conference.

[34]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[35]  Dan Olteanu,et al.  Approximate confidence computation in probabilistic databases , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[36]  Krzysztof R. Apt,et al.  Logic Programming , 1990, Handbook of Theoretical Computer Science, Volume B: Formal Models and Sematics.

[37]  Aidan Hogan,et al.  ReConRank: A Scalable Ranking Method for Semantic Web Data with Context , 2006 .

[38]  Jack Minker,et al.  Logic and Data Bases , 1978, Springer US.

[39]  Steffen Staab,et al.  Networked graphs: a declarative mechanism for SPARQL rules, SPARQL views and RDF data integration on the web , 2008, WWW.

[40]  David P. Williamson,et al.  New 3/4-Approximation Algorithms for the Maximum Satisfiability Problem , 1994, SIAM J. Discret. Math..

[41]  Manolis Koubarakis,et al.  Evaluating Conjunctive Triple Pattern Queries over Large Structured Overlay Networks , 2006, SEMWEB.

[42]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[43]  Gade Krishna,et al.  A scalable peer-to-peer lookup protocol for Internet applications , 2012 .

[44]  Bart Selman,et al.  A general stochastic approach to solving problems with hard and soft constraints , 1996, Satisfiability Problem: Theory and Applications.

[45]  Vipul Kashyap,et al.  Proceedings of SWDB'03, The first International Workshop on Semantic Web and Databases, Co-located with VLDB 2003, Humboldt-Universität, Berlin, Germany, September 7-8, 2003 , 2003, SWDB.

[46]  Ilkka Niemelä,et al.  Smodels - An Implementation of the Stable Model and Well-Founded Semantics for Normal LP , 1997, LPNMR.

[47]  Hu Bo,et al.  HPRD: a high performance RDF database , 2007 .

[48]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[49]  Amit P. Sheth,et al.  Estimating the cardinality of RDF graph patterns , 2007, WWW '07.

[50]  Eyal Oren,et al.  Sindice.com: a document-oriented lookup index for open linked data , 2008, Int. J. Metadata Semant. Ontologies.

[51]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[52]  Parag Agrawal,et al.  Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS (Demo) , 2007, CIDR.

[53]  Min Cai,et al.  MAAN: A Multi-Attribute Addressable Network for Grid Information Services , 2003, Journal of Grid Computing.

[54]  Jürgen Umbrich,et al.  YARS2: A Federated Repository for Querying Graph Structured Data from the Web , 2007, ISWC/ASWC.

[55]  Christopher Ré,et al.  Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS , 2011, Proc. VLDB Endow..

[56]  Joseph M. Hellerstein,et al.  The declarative imperative: experiences and conjectures in distributed logic , 2010, SGMD.

[57]  Mohamed F. Mokbel,et al.  RDF Data-Centric Storage , 2009, 2009 IEEE International Conference on Web Services.

[58]  Bart Selman,et al.  Towards Efficient Sampling: Exploiting Random Walk Strategies , 2004, AAAI.

[59]  Mohamed Yahya,et al.  Time-aware Reasoning in Uncertain Knowledge Bases , 2010, MUD.

[60]  Abraham Bernstein,et al.  The Semantic Web - ISWC 2009, 8th International Semantic Web Conference, ISWC 2009, Chantilly, VA, USA, October 25-29, 2009. Proceedings , 2009, SEMWEB.

[61]  Karl Aberer,et al.  GridVine: Building Internet-Scale Semantic Overlay Networks , 2004, SEMWEB.

[62]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[63]  Pedro M. Domingos,et al.  A General Method for Reducing the Complexity of Relational Inference and its Application to MCMC , 2008, AAAI.

[64]  Rudi Studer,et al.  Semantic Search - Using Graph-Structured Semantic Models for Supporting the Search Process , 2009, ICCS.

[65]  Dave Reynolds,et al.  SPARQL basic graph pattern optimization using selectivity estimation , 2008, WWW.

[66]  Andrew McCallum,et al.  Introduction to Statistical Relational Learning , 2007 .

[67]  Peter Haase,et al.  An evaluation of approaches to federated query processing over linked data , 2010, I-SEMANTICS '10.

[68]  Norbert Fuhr,et al.  Probabilistic Datalog—a logic for powerful retrieval methods , 1995, SIGIR '95.

[69]  Yimin Wang,et al.  A decentralized infrastructure for query answering over distributed ontologies , 2007, SAC '07.

[70]  Christian Bizer,et al.  Executing SPARQL Queries over the Web of Linked Data , 2009, SEMWEB.

[71]  Bin Chen,et al.  Semantic Web Portal: A Platform for Better Browsing and Visualizing Semantic Data , 2010, AMT.

[72]  David Scott Warren,et al.  Memoing for logic programs , 1992, CACM.

[73]  George H. L. Fletcher,et al.  Scalable indexing of RDF graphs for efficient join processing , 2009, CIKM.

[74]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..

[75]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[76]  Steffen Staab,et al.  Federated Data Management and Query Optimization for Linked Open Data , 2011, New Directions in Web Data Management 1.

[77]  Scott A. Smolka,et al.  CCS expressions, finite state processes, and three problems of equivalence , 1983, PODC '83.

[78]  Dean Allemang,et al.  The Semantic Web - ISWC 2006, 5th International Semantic Web Conference, ISWC 2006, Athens, GA, USA, November 5-9, 2006, Proceedings , 2006, SEMWEB.

[79]  Zhaohui Wu,et al.  Towards a Semantic Web of Relational Databases: A Practical Semantic Toolkit and an In-Use Case from Traditional Chinese Medicine , 2006, SEMWEB.

[80]  J. Carroll,et al.  Jena: implementing the semantic web recommendations , 2004, WWW Alt. '04.

[81]  Lise Getoor,et al.  Read-once functions and query evaluation in probabilistic databases , 2010, Proc. VLDB Endow..

[82]  Andreas Harth,et al.  VisiNav: Visual Web Data Search and Navigation , 2009, DEXA.

[83]  Eugene Inseok Chong,et al.  An Efficient SQL-based RDF Querying Scheme , 2005, VLDB.

[84]  Wolfram Wöß,et al.  A Semantic Web middleware for Virtual Data Integration on the Web , 2008, ESWC.

[85]  Amol Deshpande,et al.  Lineage processing over correlated probabilistic databases , 2010, SIGMOD Conference.

[86]  Georg Gottlob,et al.  Complexity and expressive power of logic programming , 2001, CSUR.

[87]  David R. O'Hallaron,et al.  Distributed Parallel Inference on Large Factor Graphs , 2009, UAI.

[88]  Bo Hu,et al.  Path Queries Based RDF Index , 2005, 2005 First International Conference on Semantics, Knowledge and Grid.

[89]  Dave Reynolds,et al.  Efficient RDF Storage and Retrieval in Jena2 , 2003, SWDB.

[90]  Dan Olteanu,et al.  MayBMS: Managing Incomplete Information with Probabilistic World-Set Decompositions , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[91]  Jeremy J. Carroll,et al.  Named graphs , 2005, J. Web Semant..

[92]  Michael Pittarelli,et al.  The Theory of Probabilistic Databases , 1987, VLDB.

[93]  Philipp Obermeier,et al.  A Cost Model for Querying Distributed RDF-Repositories with SPARQL , 2008 .

[94]  Enrico Motta,et al.  Characterizing Knowledge on the Semantic Web with Watson , 2007, EON.

[95]  Timothy W. Finin,et al.  Swoogle: a search and metadata engine for the semantic web , 2004, CIKM '04.

[96]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[97]  Jian Li,et al.  A unified approach to ranking in probabilistic databases , 2009, The VLDB Journal.

[98]  Richard M. Karp,et al.  Monte-Carlo algorithms for enumeration and reliability problems , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[99]  Andrew McCallum,et al.  Scalable probabilistic databases with factor graphs and MCMC , 2010, Proc. VLDB Endow..

[100]  Christopher Ré,et al.  Managing Probabilistic Data with MystiQ : The Can-Do , the Could-Do , and the Can ’ t-Do ? , 2008 .

[101]  Robert Morris,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM 2001.

[102]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[103]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[104]  Jeff Heflin,et al.  LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..

[105]  Asunción Gómez-Pérez,et al.  Oyster: sharing and re-using ontologies in a peer-to-peer community , 2006, WWW '06.

[106]  Ben Taskar,et al.  Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning) , 2007 .

[107]  Dov M. Gabbay,et al.  What Is Negation as Failure? , 2012, Logic Programs, Norms and Action.

[108]  Sebastian Rudolph,et al.  Conceptual Structures: Leveraging Semantic Technologies , 2009 .

[109]  Martin Theobald,et al.  Resolving Temporal Conflicts in Inconsistent RDF Knowledge Bases , 2011, BTW.

[110]  Marianne Winslett,et al.  Scientific and Statistical Database Management, 21st International Conference, SSDBM 2009, New Orleans, LA, USA, June 2-4, 2009, Proceedings , 2009, SSDBM.

[111]  Sebastian Riedel Cutting Plane MAP Inference for Markov Logic , 2009 .

[112]  Michael Kifer,et al.  OpenRuleBench: an analysis of the performance of rule engines , 2009, WWW '09.

[113]  Enrico Motta,et al.  The Semantic Web - ISWC 2005, 4th International Semantic Web Conference, ISWC 2005, Galway, Ireland, November 6-10, 2005, Proceedings , 2005, SEMWEB.

[114]  Amit P. Sheth,et al.  Graph Summaries for Subgraph Frequency Estimation , 2008, ESWC.

[115]  Michael Stonebraker A Database Perspective , 1982, On Conceptual Modelling.

[116]  Daisy Zhe Wang,et al.  Probabilistic declarative information extraction , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[117]  Günter Ladwig,et al.  Linked Data Query Processing Strategies , 2010, SEMWEB.

[118]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[119]  Frank van Harmelen,et al.  A semantic web primer , 2004 .

[120]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[121]  Orri Erling,et al.  RDF Support in the Virtuoso DBMS , 2007, CSSW.

[122]  Pedro M. Domingos,et al.  Sound and Efficient Inference with Probabilistic and Deterministic Dependencies , 2006, AAAI.

[123]  C. Grundmann Humboldt‐Universität Berlin , 1950 .

[124]  V. S. Subrahmanian,et al.  GRIN: A Graph Based RDF Index , 2007, AAAI.

[125]  Gerhard Weikum,et al.  Scalable knowledge harvesting with high precision and high recall , 2011, WSDM '11.

[126]  Daniel J. Abadi,et al.  SW-Store: a vertically partitioned DBMS for Semantic Web data management , 2009, The VLDB Journal.

[127]  Christopher Ré,et al.  MYSTIQ: a system for finding more answers by using probabilities , 2005, SIGMOD '05.

[128]  Raghu Ramakrishnan,et al.  Optimizing mpf queries: decision support and probabilistic inference , 2007, SIGMOD '07.

[129]  Wenfei Fan,et al.  Keys with Upward Wildcards for XML , 2001, DEXA.

[130]  Wolfgang Faber,et al.  Logic Programming and Nonmonotonic Reasoning , 2011, Lecture Notes in Computer Science.

[131]  Jennifer Widom,et al.  LIVE: A Lineage-Supported Versioned DBMS , 2010, SSDBM.

[132]  Catriel Beeri,et al.  On the power of magic , 1987, J. Log. Program..

[133]  Dan Olteanu,et al.  SPROUT: Lazy vs. Eager Query Plans for Tuple-Independent Probabilistic Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[134]  Vassilis Christophides,et al.  Benchmarking Database Representations of RDF/S Stores , 2005, SEMWEB.

[135]  Robert A. Kowalski,et al.  Linear Resolution with Selection Function , 1971, Artif. Intell..

[136]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[137]  Umberto Straccia,et al.  Managing Uncertainty and Vagueness in Description Logics, Logic Programs and Description Logic Programs , 2008, Reasoning Web.

[138]  Susanne E. Hambrusch,et al.  Database Support for Probabilistic Attributes and Tuples , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[139]  Dan Roth,et al.  On the Hardness of Approximate Reasoning , 1993, IJCAI.

[140]  Martin L. Kersten,et al.  Column-store support for RDF data management: not all swans are white , 2008, Proc. VLDB Endow..

[141]  Fabian M. Suchanek,et al.  URDF: Efficient Reasoning in Uncertain RDF Knowledge Bases with Soft and Hard Rules , 2010 .

[142]  Karl Aberer,et al.  P-Grid: a self-organizing structured P2P system , 2003, SGMD.

[143]  Gerhard Weikum,et al.  The RDF-3X engine for scalable management of RDF data , 2010, The VLDB Journal.

[144]  Rina Dechter,et al.  Bucket Elimination: A Unifying Framework for Reasoning , 1999, Artif. Intell..

[145]  Joseph E. Gonzalez,et al.  GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[146]  Yugyung Lee,et al.  OntoKhoj: a semantic web portal for ontology searching, ranking and classification , 2003, WIDM '03.

[147]  Yuqing Wu,et al.  XML-based RDF data management for efficient query processing , 2010, WebDB '10.

[148]  Kevin Chen-Chuan Chang,et al.  URank: formulation and efficient evaluation of top-k queries in uncertain databases , 2007, SIGMOD '07.

[149]  Luciano Serafini,et al.  Querying the Web of Data: A Formal Approach , 2009, ASWC.

[150]  Lora Aroyo,et al.  The Semantic Web: Research and Applications , 2009, Lecture Notes in Computer Science.

[151]  Daisy Zhe Wang,et al.  BayesStore: managing large, uncertain data repositories with probabilistic graphical models , 2008, Proc. VLDB Endow..

[152]  P. Damlen,et al.  Gibbs sampling for Bayesian non‐conjugate and hierarchical models by using auxiliary variables , 1999 .

[153]  Joseph Gonzalez,et al.  Residual Splash for Optimally Parallelizing Belief Propagation , 2009, AISTATS.

[154]  Dan Roth,et al.  Integer linear programming inference for conditional random fields , 2005, ICML.

[155]  Yuzhong Qu,et al.  Searching Linked Objects with Falcons: Approach, Implementation and Evaluation , 2009, Int. J. Semantic Web Inf. Syst..

[156]  Alberto O. Mendelzon,et al.  Formal models of Web queries , 1997, Inf. Syst..

[157]  Ming-Wei Chang,et al.  Learning and Inference with Constraints , 2008, AAAI.

[158]  Jeff Z. Pan,et al.  ONTOSEARCH2: SEARCHING AND QUERYING WEB ONTOLOGIES , 2007 .

[159]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[160]  Lise Getoor,et al.  PrDB: managing and exploiting rich correlations in probabilistic databases , 2009, The VLDB Journal.

[161]  Giorgio Terracina,et al.  Experimenting with recursive queries in database and logic programming systems , 2007, Theory and Practice of Logic Programming.

[162]  Haofen Wang,et al.  Hermes: Data Web search on a pay-as-you-go integration infrastructure , 2009, J. Web Semant..

[163]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[164]  Christopher Ré,et al.  Probabilistic databases: diamonds in the dirt , 2009, CACM.

[165]  Min Cai,et al.  RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network , 2004, WWW '04.

[166]  Bo Hu,et al.  HPRD: a high performance RDF database , 2007, Int. J. Parallel Emergent Distributed Syst..

[167]  Andrew McCallum,et al.  FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs , 2009, NIPS.

[168]  Susanne E. Hambrusch,et al.  Orion 2.0: native support for uncertain data , 2008, SIGMOD Conference.

[169]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[170]  Jeffrey M. Bradshaw,et al.  Applying KAoS Services to Ensure Policy Compliance for Semantic Web Services Workflow Composition and Enactment , 2004, SEMWEB.

[171]  Stefan Decker,et al.  Sig.ma: live views on the web of data , 2010, WWW '10.

[172]  Pedro M. Domingos,et al.  Memory-Efficient Inference in Relational Domains , 2006, AAAI.

[173]  Jürgen Umbrich,et al.  SWSE: Answers Before Links! , 2007, Semantic Web Challenge.

[174]  Gerhard Weikum,et al.  Scalable join processing on very large RDF graphs , 2009, SIGMOD Conference.

[175]  Christoph Koch,et al.  A compositional query algebra for second-order logic and uncertain databases , 2008, ICDT '09.