FuseM: Query-Centric Data Fusion on Structured Web Markup

Embedded markup based on Microdata, RDFa, and Microformats have become prevalent on the Web and constitute an unprecedented source of data. However, RDF statements extracted from markup are fundamentally different from traditional RDF graphs: entity descriptions are flat, facts are highly redundant, and despite very frequent co-references explicit links are missing. Therefore, carrying out typical entity-centric tasks such as retrieval and summarisation cannot be tackled sufficiently with state-of-the-art methods and require preliminary data fusion. Given the scale and dynamics of Web markup, the applicability of general data fusion approaches is limited. We present a novel query-centric data fusion approach which overcomes such issues through a combination of entity retrieval and fusion techniques geared towards the specific challenges associated with embedded markup. To ensure precise and diverse entity descriptions, we follow a supervised learning approach and train a classifier for data fusion of a pool of candidate facts relevant to a given query and obtained through a preliminary entity retrieval step. We perform a thorough evaluation on a subset of the Web Data Commons dataset and show significant improvement over existing baselines. In addition, an investigation into the coverage and complementarity of facts from the constructed entity descriptions compared to DBpedia, shows potential for aiding tasks such as knowledge base population.