SPIDER: a system for scalable, parallel / distributed evaluation of large-scale RDF data

RDF is a data model for representing labeled directed graphs, and it is used as an important building block of semantic web. Due to its flexibility and applicability, RDF has been used in applications, such as semantic web, bioinformatics, and social networks. In these applications, large-scale graph datasets are very common. However, existing techniques are not effectively managing them. In this paper, we present a scalable, efficient query processing system for RDF data, named SPIDER, based on the well-known parallel/distributed computing framework, Hadoop. SPIDER consists of two major modules (1) the graph data loader, (2) the graph query processor. The loader analyzes and dissects the RDF data and places parts of data over multiple servers. The query processor parses the user query and distributes sub queries to cluster nodes. Also, the results of sub queries from multiple servers are gathered (and refined if necessary) and delivered to the user. Both modules utilize the MapReduce framework of Hadoop. In addition, our system supports some features of SPARQL query language. This prototype will be foundation to develop real applications with large-scale RDF graph data.

[1]  Dan Brickley,et al.  FOAF Vocabulary Specification , 2004 .

[2]  Eugene Inseok Chong,et al.  An Efficient SQL-based RDF Querying Scheme , 2005, VLDB.

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[5]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[6]  Jeffrey Heer,et al.  prefuse: a toolkit for interactive information visualization , 2005, CHI.

[7]  Michael Stonebraker,et al.  The Case for Shared Nothing , 1985, HPTS.

[8]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[9]  V Masakazu Kawamoto HDD interface technologies , 2006 .

[10]  Jignesh M. Patel,et al.  Periscope/GQ: a graph querying toolkit , 2008, Proc. VLDB Endow..

[11]  Rodrigo Lopez,et al.  Petabyte-scale innovations at the European Nucleotide Archive , 2008, Nucleic Acids Res..

[12]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[13]  Dave Reynolds,et al.  Efficient RDF Storage and Retrieval in Jena2 , 2003, SWDB.

[14]  Nigel Shadbolt,et al.  Resource Description Framework (RDF) , 2009 .

[15]  Mariano P. Consens Managing Linked Data on the Web: The LinkedMDB Showcase , 2008, 2008 Latin American Web Conference.

[16]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[17]  Christian Bizer,et al.  The Berlin SPARQL Benchmark , 2009, Int. J. Semantic Web Inf. Syst..