DistLODStats: Distributed Computation of RDF Dataset Statistics

Over the last years, the Semantic Web has been growing steadily. Today, we count more than 10,000 datasets made available online following Semantic Web standards. Nevertheless, many applications, such as data integration, search, and interlinking, may not take the full advantage of the data without having a priori statistical information about its internal structure and coverage. In fact, there are already a number of tools, which offer such statistics, providing basic information about RDF datasets and vocabularies. However, those usually show severe deficiencies in terms of performance once the dataset size grows beyond the capabilities of a single machine. In this paper, we introduce a software component for statistical calculations of large RDF datasets, which scales out to clusters of machines. More specifically, we describe the first distributed in-memory approach for computing 32 different statistical criteria for RDF datasets using Apache Spark. The preliminary results show that our distributed approach improves upon a previous centralized approach we compare against and provides approximately linear horizontal scale-up. The criteria are extensible beyond the 32 default criteria, is integrated into the larger SANSA framework and employed in at least four major usage scenarios beyond the SANSA community.

[1]  Jens Lehmann,et al.  Introduction to Linked Data and Its Lifecycle on the Web , 2013, Reasoning Web.

[2]  Jens Lehmann,et al.  Distributed Semantic Analytics Using the SANSA Stack , 2017, SEMWEB.

[3]  Jens Lehmann,et al.  LinkedGeoData: A core for a web of spatial open data , 2012, Semantic Web.

[4]  Jens Lehmann,et al.  LODStats - An Extensible Framework for High-Performance Dataset Analytics , 2012, EKAW.

[5]  Jens Lehmann,et al.  Introduction to Linked Data and Its Lifecycle on the Web , 2013, Reasoning Web.

[6]  Eetu Mäkelä,et al.  Aether - Generating and Viewing Extended VoID Statistical Descriptions of RDF Datasets , 2014, ESWC.

[7]  María Poveda-Villalón,et al.  Linked Open Vocabularies (LOV): A gateway to reusable semantic vocabularies on the Web , 2016, Semantic Web.

[8]  Andrea Maurino,et al.  ABSTAT: Linked Data Summaries with ABstraction and STATistics , 2015, ESWC.

[9]  Wolfram Wöß,et al.  RDFStats - An Extensible RDF Statistics Generator and Library , 2009, 2009 20th International Workshop on Database and Expert Systems Application.

[10]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[11]  Jens Lehmann,et al.  The BigDataEurope Platform - Supporting the Variety Dimension of Big Data , 2017, ICWE.

[12]  Christian Bizer,et al.  The Berlin SPARQL Benchmark , 2009, Int. J. Semantic Web Inf. Syst..

[13]  Jens Lehmann,et al.  Linked Open Data Statistics: Collection and Exploitation , 2013, KESW.

[14]  Christoph Lange,et al.  Luzzu—A Methodology and Framework for Linked Data Quality Assessment , 2016, JDIQ.

[15]  Chen Wang,et al.  Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics , 2015, Proc. VLDB Endow..

[16]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[17]  Michele Mostarda,et al.  Processing billions of RDF triples on a single machine using streaming and sorting , 2015, SAC.

[18]  Felix Naumann,et al.  LODOP - Multi-Query Optimization for Linked Data Profiling Queries , 2014, PROFILES@ESWC.

[19]  Jens Lehmann,et al.  The Tale of Sansa Spark , 2017, SEMWEB.