论文信息 - Creating voiD descriptions for Web-scale data

Creating voiD descriptions for Web-scale data

When working with large amounts of crawled semantic data as provided by the Billion Triple Challenge (BTC), it is desirable to present the data in a manner best suited for end users. This includes conceiving and presenting explanatory metainformation. The Vocabulary of Interlinked Data (voiD) has been proposed as a means to annotate sets of RDF resources to facilitate not only human understanding, but also query optimization. In this article we introduce tools that automatically generate voiD descriptions for large datasets. Our approach comprises different means to identify (sub)datasets and annotate the derived subsets according to the voiD specification. Due to the complexity of Web-scale Linked Data, all algorithms used for partitioning and augmenting are implemented in a cloud environment utilizing the MapReduce paradigm. We employed the Billion Triple Challenge 2010 dataset [6] to evaluate our approach, and present the results in this article. We have released a tool named voiDgen to the public that allows the generation of metainformation for such large datasets.

[1] Jens Lehmann,et al. DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[2] Michael Hausenblas,et al. Describing linked datasets with the VoID vocabulary , 2011 .

[3] Jeremy J. Carroll,et al. Resource description framework (rdf) concepts and abstract syntax , 2003 .

[4] Jun Zhao,et al. Describing Linked Datasets On the Design and Usage of voiD, the "Vocabulary Of Interlinked Datasets" , 2009 .

[5] Patrick J. Hayes,et al. When owl: sameAs isn't the Same: An Analysis of Identity Links on the Semantic Web , 2010, LDOW.

[6] Simon Schenk,et al. Optimizing SPARQL Queries over Disparate RDF Data Sources through Distributed Semi-Joins , 2008, SEMWEB.

[7] Felix Naumann,et al. Profiling linked open data with ProLOD , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[8] Deborah L. McGuinness,et al. SameAs Networks and Beyond: Analyzing Deployment Status and Implications of owl: sameAs in Linked Data , 2010, International Semantic Web Conference.

[9] Deborah L. McGuinness,et al. When owl: sameAs Isn't the Same: An Analysis of Identity in Linked Data , 2010, SEMWEB.

[10] John Feo,et al. High performance semantic factoring of giga-scale semantic graph databases. , 2010 .