Roomba: An Extensible Framework to Validate and Build Dataset Profiles

Linked Open Data LOD has emerged as one of the largest collections of interlinked datasets on the web. In order to benefit from this mine of data, one needs to access to descriptive information about each dataset or metadata. This information can be used to delay data entropy, enhance dataset discovery, exploration and reuse as well as helping data portal administrators in detecting and eliminating spam. However, such metadata information is currently very limited to a few data portals where they are usually provided manually, thus being often incomplete and inconsistent in terms of quality. To address these issues, we propose a scalable automatic approach for extracting, validating, correcting and generating descriptive linked dataset profiles. This approach applies several techniques in order to check the validity of the metadata provided and to generate descriptive and statistical information for a particular dataset or for an entire data portal.

[1]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[2]  Jens Lehmann,et al.  LODStats - An Extensible Framework for High-Performance Dataset Analytics , 2012, EKAW.

[3]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[4]  Anja Jentzsch Profiling the Web of Data , 2014, DC@ISWC.

[5]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[6]  Stefan Decker,et al.  Hierarchical Link Analysis for Ranking Web Data , 2010, ESWC.

[7]  Jürgen Umbrich,et al.  LDspider: An Open-source Crawling Framework for the Web of Linked Data , 2010, SEMWEB.

[8]  Felix Naumann,et al.  Profiling linked open data with ProLOD , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[9]  Heiko Paulheim,et al.  Adoption of the Linked Data Best Practices in Different Topical Domains , 2014, SEMWEB.

[10]  Krzysztof Janowicz,et al.  There's No Money in Linked Data , 2013 .

[11]  Huiying Li,et al.  Data Profiling for Semantic Web Data , 2012, WISM.

[12]  Stefan Decker,et al.  Sig.ma: live views on the web of data , 2010, WWW '10.

[13]  Stefan Decker,et al.  Semantic Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web , 2008, ESWC.

[14]  Amit P. Sheth,et al.  Automatic Domain Identification for Linked Open Data , 2013, 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[15]  Jürgen Umbrich,et al.  Data summaries for on-demand queries over linked data , 2010, WWW '10.

[16]  Eero Hyvönen,et al.  DataFinland - A Semantic Portal for Open and Linked Datasets , 2011, ESWC.

[17]  Eetu Mäkelä,et al.  Aether - Generating and Viewing Extended VoID Statistical Descriptions of RDF Datasets , 2014, ESWC.

[18]  Matias Frosterus,et al.  Creating and Publishing Semantic Metadata about Linked and Open Datasets , 2011 .

[19]  Wolfgang Nejdl,et al.  A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles , 2014, ESWC.

[20]  Felix Naumann,et al.  Creating voiD descriptions for Web-scale data , 2011, J. Web Semant..

[21]  Michael Hausenblas,et al.  Describing Linked Datasets , 2009, LDOW.

[22]  Jürgen Umbrich,et al.  Observing Linked Data Dynamics , 2013, ESWC.

[23]  Felix Naumann,et al.  LODOP - Multi-Query Optimization for Linked Data Profiling Queries , 2014, PROFILES@ESWC.

[24]  Felix Naumann,et al.  Profiling and mining RDF data with ProLOD++ , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[25]  Mariano P. Consens,et al.  ExpLOD: Summary-Based Exploration of Interlinking and RDF Usage in the Linked Open Data Cloud , 2010, ESWC.

[26]  Enrico Motta,et al.  Watson, more than a Semantic Web search engine , 2011, Semantic Web.

[27]  Steffen Staab,et al.  SchemEX - Efficient construction of a data catalogue by stream-based indexing of linked data , 2012, J. Web Semant..

[28]  Michael Hausenblas,et al.  Describing linked datasets with the VoID vocabulary , 2011 .

[29]  Massimiliano Ciaramita,et al.  A framework for benchmarking entity-annotation systems , 2013, WWW.

[30]  Christian Bizer,et al.  Evolving the Web into a Global Data Space , 2011, BNCOD.

[31]  Felix Naumann,et al.  Latent topics in graph-structured data , 2012, CIKM.

[32]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[33]  Enrico Motta,et al.  What Should I Link to? Identifying Relevant Sources and Classes for Data Linking , 2011, JIST.

[34]  Raphaël Troncy,et al.  GERBIL: General Entity Annotator Benchmarking Framework , 2015, WWW.

[35]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[36]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[37]  Wolfram Wöß,et al.  RDFStats - An Extensible RDF Statistics Generator and Library , 2009, 2009 20th International Workshop on Database and Expert Systems Application.

[38]  Yun Peng,et al.  Swoogle: A semantic web search and metadata engine , 2004, CIKM 2004.

[39]  D. Boyd,et al.  Six Provocations for Big Data , 2011 .