SPARQL in the cloud using Rya

SPARQL is the standard query language for Resource Description Framework (RDF) data. RDF was designed with the initial goal of developing metadata for the Internet. While the number and the size of the generated RDF datasets are continually increasing, most of today's best RDF storage solutions are confined to a single node. Working on a single node has significant scalability issues, especially considering the magnitude of modern day data. In this paper we introduce Rya, a scalable RDF data management system that efficiently supports SPARQL queries. We introduce storage methods, indexing schemes, and query processing techniques that scale to billions of triples across multiple nodes, while providing fast and easy access to the data through conventional query mechanisms such as SPARQL. Our performance evaluation shows that in most cases, our system outperforms existing distributed RDF solutions, even systems much more complex than ours. HighlightsWe build a scalable RDF data management system in a cloud environment.Rya is based on Accumulo columnar store and OpenRDF Sesame framework.We used 3 indexed tables, SPO, POS, OSP, with triple data stored in the row ID.We implemented performance enhancements to scale to billions of triples and milliseconds query time for most queries.Rya provides fast and easy access to the data through SPARQL.

[1]  Nicholas Gibbins,et al.  3store: Efficient Bulk RDF Storage , 2003, PSSS.

[2]  Torben Bach Pedersen,et al.  3XL: Supporting efficient operations on very large OWL Lite triple-stores , 2011, Inf. Syst..

[3]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[4]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[5]  Sang-goo Lee,et al.  SPARQL basic graph pattern processing with iterative MapReduce , 2010, MDAC '10.

[6]  Gerhard Weikum,et al.  RDF-3X: a RISC-style engine for RDF , 2008, Proc. VLDB Endow..

[7]  Adina Crainiceanu,et al.  Rya: a scalable RDF triple store for the clouds , 2012, Cloud-I '12.

[8]  Jianling Sun,et al.  Scalable RDF store based on HBase and MapReduce , 2010, 2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE).

[9]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..

[10]  Lin Xiao,et al.  YCSB++: benchmarking and performance debugging advanced features in scalable table stores , 2011, SoCC.

[11]  Jeff Heflin,et al.  LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..

[12]  Dave Reynolds,et al.  Efficient RDF Storage and Retrieval in Jena2 , 2003, SWDB.

[13]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[14]  David Jones High performance , 1989, Nature.

[15]  GhemawatSanjay,et al.  The Google file system , 2003 .

[16]  Richard E. Schantz,et al.  High-performance, massively scalable distributed systems using the MapReduce software framework: the SHARD triple-store , 2010, PSI EtA '10.

[17]  Dave Kolas,et al.  Efficient Linked-List RDF Indexing in Parliament , 2009 .