Tough Tables: Carefully Evaluating Entity Linking for Tabular Data

Table annotation is a key task to improve querying the Web and support the Knowledge Graph population from legacy sources (tables). Last year, the SemTab challenge was introduced to unify different efforts to evaluate table annotation algorithms by providing a common interface and several general-purpose datasets as a ground truth. The SemTab dataset is useful to have a general understanding of how these algorithms work, and the organizers of the challenge included some artificial noise to the data to make the annotation trickier. However, it is hard to analyze specific aspects in an automatic way. For example, the ambiguity of names at the entity-level can largely affect the quality of the annotation. In this paper, we propose a novel dataset to complement the datasets proposed by SemTab. The dataset consists of a set of high-quality manually-curated tables with non-obviously linkable cells, i.e., where values are ambiguous names, typos, and misspelled entity names not appearing in the current version of the SemTab dataset. These challenges are particularly relevant for the ingestion of structured legacy sources into existing knowledge graphs. Evaluations run on this dataset show that ambiguity is a key problem for entity linking algorithms and encourage a promising direction for future work in the field.

[1]  Dominique Ritze,et al.  Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases , 2016, WWW.

[2]  Keyuan Jiang,et al.  A Data-Driven Method of Discovering Misspellings of Medication Names on Twitter , 2018, MIE.

[3]  Pedro Szekely,et al.  Entity Linking to Knowledge Graphs to Infer Column Types and Properties , 2019, SemTab@ISWC.

[4]  Hao Ma,et al.  Table Cell Search for Question Answering , 2016, WWW.

[5]  Natthawut Kertkeidkachorn,et al.  MTab: Matching Tabular Data to Knowledge Graph using Probability Models , 2019, SemTab@ISWC.

[6]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[7]  Usman Qamar,et al.  Identification and Correction of Misspelled Drugs Names in Electronic Medical Records (EMR) , 2016, ICEIS.

[8]  Marco Cremaschi,et al.  A fully automated approach to a complete Semantic Table Interpretation , 2020, Future Gener. Comput. Syst..

[9]  Ernesto Jiménez-Ruiz,et al.  STILTool: A Semantic Table Interpretation evaLuation Tool , 2020, ESWC.

[10]  Aljaz Kosmerlj,et al.  Semantically-Enabled Optimization of Digital Marketing Campaigns , 2019, SEMWEB.

[11]  Jiaoyan Chen,et al.  SemTab 2019: Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems , 2020, ESWC.

[12]  Vasilis Efthymiou,et al.  Matching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity Embeddings , 2017, SEMWEB.

[13]  Krisztian Balog,et al.  Novel Entity Discovery from Web Tables , 2020, WWW.

[14]  Filip De Turck,et al.  CVS2KG: Transforming Tabular Data into Semantic Knowledge , 2019, SemTab@ISWC.