Querying and managing semantic web data and scientific workflow provenance using relational databases

Explosive growth of RDF data on the Semantic Web drives the need for novel database techniques that can efficiently store and query large RDF datasets. To achieve good performance and scalability of query processing, most existing RDF stores use a relational database management system as a backend to manage RDF data. The main challenge of this approach is the translation of RDF queries, encoded in the SPARQL query language, into their equivalent relational algebra expressions and SQL queries, such that the latter can be further optimized and evaluated by the relational query engine and their results can be returned as SPARQL query solutions. Although much work has been done on translating SPARQL queries to SQL queries, existing translation procedures are (i) not based on a formal semantics and thus not provably semantics preserving, (ii) not efficient in the processing of nested optional graph patterns, and (iii) not optimized for scientific workflow provenance querying. In this dissertation, we propose the first provably semantics-preserving SPARQL-to-SQL translation algorithm and develop RDFPROV, a relational RDF store for querying and managing Semantic Web data and scientific workflow provenance. Our main research contributions are: (i) We formalize a relational algebra based semantics of SPARQL and prove its equivalence to the mapping-based semantics of SPARQL; (ii) We define the first provably semantics preserving and generic SPARQL-to-SQL translation in the literature with support of SPARQL triple patterns, basic graph patterns, optional graph patterns, alternative graph patterns, and value constraints; (iii) We propose a novel relational join, nested optional join, to efficiently evaluate SPARQL queries with well-designed graph patterns and nested optional patterns; and (iv) We design the first relational RDF store RDFProv that is optimized for storing and querying scientific workflow provenance as part of the Semantic Web of Scientific Workflow Provenance.