Embedded markup based on Microdata, RDFa, and Microformats have become prevalent on the Web and constitute an unprecedented source of data. However, RDF statements extracted from markup are fundamentally different from traditional RDF graphs: entity descriptions are flat, facts are highly redundant, and despite very frequent co-references explicit links are missing. Therefore, carrying out typical entity-centric tasks such as retrieval and summarisation cannot be tackled sufficiently with state-of-the-art methods and require preliminary data fusion. Given the scale and dynamics of Web markup, the applicability of general data fusion approaches is limited. We present a novel query-centric data fusion approach which overcomes such issues through a combination of entity retrieval and fusion techniques geared towards the specific challenges associated with embedded markup. To ensure precise and diverse entity descriptions, we follow a supervised learning approach and train a classifier for data fusion of a pool of candidate facts relevant to a given query and obtained through a preliminary entity retrieval step. We perform a thorough evaluation on a subset of the Web Data Commons dataset and show significant improvement over existing baselines. In addition, an investigation into the coverage and complementarity of facts from the constructed entity descriptions compared to DBpedia, shows potential for aiding tasks such as knowledge base population.
[1]
Stefan Dietze,et al.
A Survey on Challenges in Web Markup Data for Entity Retrieval
,
2016,
International Semantic Web Conference.
[2]
Evgeniy Gabrilovich,et al.
Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis
,
2007,
IJCAI.
[3]
Roi Blanco,et al.
Effective and Efficient Entity Search in RDF Data
,
2011,
SEMWEB.
[4]
Christian Bizer,et al.
The WebDataCommons Microdata, RDFa and Microformat Dataset Series
,
2014,
International Semantic Web Conference.
[5]
Andrew W. Moore,et al.
X-means: Extending K-means with Efficient Estimation of the Number of Clusters
,
2000,
ICML.
[6]
Anselm L. Strauss,et al.
Qualitative Analysis For Social Scientists
,
1987
.
[7]
Stefan Dietze,et al.
Towards Entity Summarisation on Structured Web Markup
,
2016,
ESWC.
[8]
Heiko Paulheim,et al.
Heuristics for Fixing Common Errors in Deployed schema.org Microdata
,
2015,
ESWC.