Learning to Map Wikidata Entities To Predefined Topics

Recently much progress has been made in entity disambiguation and linking systems (EDL). Given a piece of text, EDL links words and phrases to entities in a knowledge base, where each entity defines a specific concept. Although extracted entities are informative, they are often too specific to be used directly by many applications. These applications usually require text content to be represented with a smaller set of predefined concepts or topics, belonging to a topical taxonomy, that matches their exact needs. In this study, we aim to build a system that maps Wikidata entities to such predefined topics. We explore a wide range of methods that map entities to topics, including GloVe similarity, Wikidata predicates, Wikipedia entity definitions, and entity-topic co-occurrences. These methods often predict entity-topic mappings that are reliable, i.e., have high precision, but tend to miss most of the mappings, i.e., have low recall. Therefore, we propose an ensemble system that effectively combines individual methods and yields much better performance, comparable with human annotators.

[1]  Joel Nothman,et al.  Learning multilingual named entity recognition from Wikipedia , 2013, Artif. Intell..

[2]  Preeti Bhargava,et al.  Lithium NLP: A System for Rich Information Extraction from Noisy User Generated Text on Social Media , 2017, NUT@EMNLP.

[3]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[4]  Udo Kruschwitz,et al.  Automatic Creation of Arabic Named Entity Annotated Corpus Using Wikipedia , 2014, EACL.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[7]  Paul P. Maglio,et al.  Expertise identification using email communications , 2003, CIKM '03.

[8]  James Caverlee,et al.  Mining Potential Domain Expertise in Pinterest , 2013, UMAP Workshops.

[9]  Yizhou Sun,et al.  ETM: Entity Topic Models for Mining Documents Associated with Entities , 2012, 2012 IEEE 12th International Conference on Data Mining.

[10]  Preeti Bhargava,et al.  Klout Topics for Modeling Interests and Expertise of Users Across Social Networks , 2017, ArXiv.

[11]  Krisztian Balog,et al.  People search in the enterprise , 2007, SIGIR.

[12]  Jinyun Yan,et al.  LASTA: large scale topic assignment on multiple social networks , 2014, KDD.

[13]  Preeti Bhargava,et al.  High-Throughput and Language-Agnostic Entity Disambiguation and Linking on User Generated Data , 2017, LDOW@WWW.

[14]  Nemanja Spasojevic,et al.  Klout score: Measuring influence across multiple social networks , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[15]  M. de Rijke,et al.  SaHaRa : Discovering Entity-Topic Associations in Online News , 2009 .

[16]  Gerhard Weikum,et al.  AIDA-light: High-Throughput Named-Entity Disambiguation , 2014, LDOW.

[17]  Mark Dredze,et al.  Entity Linking: Finding Extracted Entities in a Knowledge Base , 2013, Multi-source, Multilingual Information Extraction and Summarization.

[18]  Preeti Bhargava,et al.  DAWT: Densely Annotated Wikipedia Texts Across Multiple Languages , 2017, WWW.

[19]  Padhraic Smyth,et al.  Statistical entity-topic models , 2006, KDD '06.

[20]  Prantik Bhattacharyya,et al.  Mining half a billion topical experts across multiple social networks , 2016, Social Network Analysis and Mining.

[21]  James Allan,et al.  An Exploration of Entity Models, Collective Classification and Relation Description , 2004 .

[22]  Prantik Bhattacharyya,et al.  Global Entity Ranking Across Multiple Languages , 2017, WWW.