SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous Labels

Webpages are an abundant source of textual information with manually annotated entity links, and are often used as a source of training data for a wide variety of machine learning NLP tasks. However, manual annotations such as those found on Wikipedia are sparse, noisy, and biased towards popular entities. Existing entity linking systems deal with those issues by relying on simple statistics extracted from the data. While such statistics can effectively deal with noisy annotations, they introduce bias towards head entities and are ineffective for long tail (e.g., unpopular) entities. In this work, we first analyze statistical properties linked to manual annotations by studying a large annotated corpus composed of all English Wikipedia webpages, in addition to all pages from the CommonCrawl containing English Wikipedia annotations. We then propose and evaluate a series of entity linking approaches, with the explicit goal of creating highly-accurate (precision > 95%) and broad annotated corpuses for machine learning tasks. Our results show that our best approach achieves maximal-precision at usable recall levels, and outperforms both state-of-the-art entity-linking systems and human annotators.

[1]  Giuseppe Ottaviano,et al.  Fast and Space-Efficient Entity Linking for Queries , 2015, WSDM.

[2]  Rajeev Rastogi,et al.  Entity disambiguation with hierarchical topic models , 2011, KDD.

[3]  Karl Aberer,et al.  idMesh: graph-based disambiguation of linked data , 2009, WWW '09.

[4]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[5]  Gianluca Demartini,et al.  Pooling-based continuous evaluation of information retrieval systems , 2015, Information Retrieval Journal.

[6]  M. de Rijke,et al.  Adding semantics to microblog posts , 2012, WSDM '12.

[7]  Xianpei Han,et al.  A Generative Entity-Mention Model for Linking Entities with Knowledge Base , 2011, ACL.

[8]  Yasemin Altun,et al.  Broad-Coverage Sense Disambiguation and Information Extraction with a Supersense Sequence Tagger , 2006, EMNLP.

[9]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[10]  Krisztian Balog,et al.  Entity linking and retrieval , 2013, SIGIR.

[11]  Raphaël Troncy,et al.  GERBIL: General Entity Annotator Benchmarking Framework , 2015, WWW.

[12]  Zhaochen Guo,et al.  Robust Entity Linking via Random Walks , 2014, CIKM.

[13]  Xianpei Han,et al.  Named entity disambiguation by leveraging wikipedia semantic knowledge , 2009, CIKM.

[14]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[15]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[16]  Gianluca Demartini,et al.  ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking , 2012, WWW.

[17]  Thomas Hofmann,et al.  Probabilistic Bag-Of-Hyperlinks Model for Entity Linking , 2015, WWW.

[18]  Paolo Ferragina,et al.  TAGME: on-the-fly annotation of short text fragments (by wikipedia entities) , 2010, CIKM.

[19]  Roberto Navigli,et al.  Entity Linking meets Word Sense Disambiguation: a Unified Approach , 2014, TACL.

[20]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[21]  Gianluca Demartini,et al.  Effective named entity recognition for idiosyncratic web collections , 2014, WWW.