Knowledge-Graph-Based Semantic Labeling of Tabular Data

A lot of data are published on the Web using tabular data formats (e.g., spreadsheets). This is especially the case for the data made available in open data portals by public and private institutions. However, one of the main challenges for their effective (re)use is their generalized lack of semantics: column names are not usually standardized, and their meaning and content are not always clear. In parallel, knowledge graphs have started to be widely adopted by some data providers as a means to publish large amounts of structured data. They commonly use graph-based formats (e.g., RDF) and make references to lightweight ontologies. It is well understood that the reuse of such tabular data may be improved by annotating them with the classes and properties used by the data available in knowledge graphs. Several challenges exist in performing semantic labeling, such as the commonality or duplication of entity names, the difference in measurements and rounding errors of numeric values, and the noise in published tabular data and knowledge graphs. In this work, we present a novel approach to automatically label columns in tabular data with ontology classes and properties referred to by existing knowledge graphs. We evaluated the performance of our approach on entity columns and numeric columns separately. For the entity columns, we applied our approach to annotated tables from the T2D gold standard. For the numeric columns, we manually annotated numeric columns in the T2D gold standard and then applied our technique to this data. We report the performance of our approach using precision, recall, and F1 scores, which is the conventional way to report the performance of semantic labeling in the literature. The experiments showed that our proposed approach successfully labeled the majority of the entity and numeric columns in the used dataset. In contrast with other existing proposals in the state-of-the-art, our approach does not require the use of external linguistic resources, other sources of information, or human in the loop.