Managing big RDF data in clouds: Challenges, opportunities, and solutions

Abstract The expansion of the services of the Semantic Web and the evolution of cloud computing technologies have significantly enhanced the capability of preserving and publishing information in standard open web formats, such that data can be both human-readable and machine-processable. This situation meets the challenge in the current big data era to effectively store, retrieve, and analyze resource description framework (RDF) data in swarms. This paper presents an overview of the existing challenges, evolving opportunities, and current developments towards managing big RDF data in clouds and provides guidance and substantial lessons learned from research in big data management. In particular, it highlights the basic principles of RDF data management, which allow researchers to know the most recent stage in developing RDF graphs and its achievement. Additionally, the research provides comparative studies among current storage systems and query processing approaches in understanding their efficiency. The paper also provides a vision for long-term future research directions by providing highlights on future challenges and opportunities in RDF domain.

[1]  Danh Le Phuoc,et al.  RDF On the Go: RDF Storage and Query Processor for Mobile Devices , 2010, SEMWEB.

[2]  Ioana Manolescu,et al.  RDF in the clouds: a survey , 2014, The VLDB Journal.

[3]  Kyong-Ho Lee,et al.  RDFChain: Chain Centric Storage for Scalable Join Processing of RDF Graphs using MapReduce and HBase , 2013, International Semantic Web Conference.

[4]  Manolis Koubarakis,et al.  Storing and Querying RDF Data in Atlas , 2006 .

[5]  J. Carroll,et al.  Jena: implementing the semantic web recommendations , 2004, WWW Alt. '04.

[6]  Jeff Heflin,et al.  LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..

[7]  Jeff Heflin,et al.  Exploring Linked Data with contextual tag clouds , 2014, J. Web Semant..

[8]  Chung-Hong Lee,et al.  Leveraging microblogging big data with a modified density-based clustering approach for event awareness and topic ranking , 2013, J. Inf. Sci..

[9]  François Goasdoué,et al.  AMADA: web data repositories in the amazon cloud , 2012, CIKM.

[10]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[11]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[12]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[13]  Roy D. Sleator,et al.  'Big data', Hadoop and cloud computing in genomics , 2013, J. Biomed. Informatics.

[14]  Guang Yang,et al.  Dynamic and fast processing of queries on large-scale RDF data , 2014, Knowledge and Information Systems.

[15]  Barry Bishop,et al.  The Features of BigOWLIM that Enabled the BBC's World Cup Website , 2010 .

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Erhard Rahm,et al.  Management and Analysis of Big Graph Data: Current Systems and Open Challenges , 2017, Handbook of Big Data Technologies.

[18]  Rajkumar Buyya,et al.  MapReduce-Based Algorithms for Managing Big RDF Graphs: State-of-the-Art Analysis, Paradigms, and Future Directions , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[19]  Roberto De Virgilio,et al.  A scalable and extensible framework for query answering over RDF , 2011, World Wide Web.

[20]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[21]  Philippe Cudré-Mauroux,et al.  DiploCloud: Efficient and Scalable Management of RDF Data in the Cloud , 2016, IEEE Transactions on Knowledge and Data Engineering.

[22]  Paul T. Groth,et al.  Linked Data Management , 2017, Handbook of Big Data Technologies.

[23]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[24]  Daniel J. Abadi,et al.  SW-Store: a vertically partitioned DBMS for Semantic Web data management , 2009, The VLDB Journal.

[25]  George H. L. Fletcher,et al.  Scalable indexing of RDF graphs for efficient join processing , 2009, CIKM.

[26]  Iztok Savnik,et al.  Survey of RDF Storage Managers , 2014 .

[27]  Jukka Riekki,et al.  Connecting IoT Sensors to Knowledge-based Systems by Transforming SenML to RDF , 2014, ANT/SEIT.

[28]  Yong Zhao,et al.  Cloud Computing and Grid Computing 360-Degree Compared , 2008, GCE 2008.

[29]  Atanas Kiryakov,et al.  OWLIM - A Pragmatic Semantic Repository for OWL , 2005, WISE Workshops.

[30]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[31]  Seán O'Riain,et al.  Querying Heterogeneous Datasets on the Linked Data Web: Challenges, Approaches, and Trends , 2012, IEEE Internet Computing.

[32]  R. Doyle The American terrorist. , 2001, Scientific American.

[33]  Min Cai,et al.  RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network , 2004, WWW '04.

[34]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[35]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[36]  William W. Cohen,et al.  Power Iteration Clustering , 2010, ICML.

[37]  Hyoung-Joo Kim,et al.  RG-index: An RDF graph index for efficient SPARQL query processing , 2014, Expert Syst. Appl..

[38]  Andreas Harth,et al.  CumulusRDF: Linked Data Management on Nested Key-Value Stores , 2011 .

[39]  Bhavani M. Thuraisingham,et al.  Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing , 2011, IEEE Transactions on Knowledge and Data Engineering.

[40]  Georg Lausen,et al.  Cascading Map-Side Joins over HBase for Scalable Join Processing , 2012, SSWS+HPCSW@ISWC.

[41]  Alexandros Labrinidis,et al.  Challenges and Opportunities with Big Data , 2012, Proc. VLDB Endow..

[42]  Adina Crainiceanu,et al.  SPARQL in the cloud using Rya , 2015, Inf. Syst..

[43]  Manfred Hauswirth,et al.  Scalable distributed indexing and query processing over Linked Data , 2012, J. Web Semant..

[44]  Bhavani M. Thuraisingham,et al.  Jena-HBase: A Distributed, Scalable and Effcient RDF Triple Store , 2012, SEMWEB.

[45]  Lei Zou,et al.  Graph-Based RDF Data Management , 2017, Data Science and Engineering.

[46]  Athanasios V. Vasilakos,et al.  Web of Things Data Storage , 2017, Managing the Web of Things.

[47]  Latifur Khan,et al.  Data intensive query processing for semantic web data using hadoop and mapreduce , 2011 .

[48]  Amit P. Sheth,et al.  Estimating the cardinality of RDF graph patterns , 2007, WWW '07.

[49]  Eyal Oren,et al.  Sindice.com: a document-oriented lookup index for open linked data , 2008, Int. J. Metadata Semant. Ontologies.

[50]  Liyang Yu Linked Open Data , 2011 .

[51]  Olivier Curé,et al.  WaterFowl: A Compact, Self-indexed and Inference-Enabled Immutable RDF Store , 2014, ESWC.

[52]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[53]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[54]  Frank van Harmelen,et al.  Marvin: Distributed reasoning over large-scale Semantic Web data , 2009, J. Web Semant..

[55]  Abraham Bernstein,et al.  Querying a messy web of data with Avalanche , 2014, J. Web Semant..

[56]  Gerhard Weikum,et al.  RDF-3X: a RISC-style engine for RDF , 2008, Proc. VLDB Endow..

[57]  John Abraham,et al.  Efficient Processing of Semantic Web Queries in HBase and MySQL Cluster , 2013, IT Professional.

[58]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[59]  Sang-goo Lee,et al.  Tridex: A lightweight triple index for relational database-based Semantic Web data management , 2013, Expert Syst. Appl..

[60]  Divyakant Agrawal,et al.  G-Store: a scalable data store for transactional multi key access in the cloud , 2010, SoCC '10.

[61]  Ioannis Konstantinou,et al.  H2RDF: adaptive query processing on RDF data in the cloud. , 2012, WWW.

[62]  George Papadakis,et al.  Big, Linked Geospatial Data and Its Applications in Earth Observation , 2017, IEEE Internet Computing.

[63]  Hong-Gee Kim,et al.  xStore: Federated temporal query processing for large scale RDF triples on a cloud environment , 2017, Neurocomputing.

[64]  Deborah L. McGuinness,et al.  OWL Web ontology language overview , 2004 .

[65]  Guillaume Blin,et al.  A survey of RDF storage approaches , 2012, ARIMA J..

[66]  Said Mirza Pahlevi,et al.  RDFCube: A P2P-Based Three-Dimensional Index for Structural Joins on Distributed Triple Stores , 2005, DBISP2P.

[67]  Jian Pei,et al.  A spatiotemporal compression based approach for efficient big data processing on Cloud , 2014, J. Comput. Syst. Sci..

[68]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..

[69]  A Clara Kanmani,et al.  An Exploratory Study of RDF: A Data Model for Cloud Computing , 2016, FICTA.

[70]  Chang Liu,et al.  Towards Efficient SPARQL Query Processing on RDF Data , 2010 .

[71]  María Bermúdez-Edo,et al.  IoT-Lite: a lightweight semantic model for the internet of things and its use with dynamic semantics , 2016, Personal and Ubiquitous Computing.

[72]  M. Tamer Özsu A survey of RDF data management systems , 2016, Frontiers of Computer Science.

[73]  Richard E. Schantz,et al.  High-performance, massively scalable distributed systems using the MapReduce software framework: the SHARD triple-store , 2010, PSI EtA '10.

[74]  Georg Lausen,et al.  PigSPARQL: mapping SPARQL to Pig Latin , 2011, SWIM '11.

[75]  Frank van Harmelen,et al.  Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema , 2002, SEMWEB.

[76]  Sven Groppe,et al.  Data Management and Query Processing in Semantic Web Databases , 2011 .

[77]  Ce-Kuen Shieh,et al.  A SPARQL query processing system using map-phase-multi join for big data in clouds , 2017 .

[78]  Peter B. McGarvey,et al.  Infrastructure for the life sciences: design and implementation of the UniProt website , 2009, BMC Bioinformatics.

[79]  Kevin Wilkinson,et al.  Jena Property Table Implementation , 2006 .