Collective context-aware topic models for entity disambiguation

A crucial step in adding structure to unstructured data is to identify references to entities and disambiguate them. Such disambiguated references can help enhance readability and draw similarities across different pieces of running text in an automated fashion. Previous research has tackled this problem by first forming a catalog of entities from a knowledge base, such as Wikipedia, and then using this catalog to disambiguate references in unseen text. However, most of the previously proposed models either do not use all text in the knowledge base, potentially missing out on discriminative features, or do not exploit word-entity proximity to learn high-quality catalogs. In this work, we propose topic models that keep track of the context of every word in the knowledge base; so that words appearing within the same context as an entity are more likely to be associated with that entity. Thus, our topic models utilize all text present in the knowledge base and help learn high-quality catalogs. Our models also learn groups of co-occurring entities thus enabling collective disambiguation. Unlike most previous topic models, our models are non-parametric and do not require the user to specify the exact number of groups present in the knowledge base. In experiments performed on an extract of Wikipedia containing almost 60,000 references, our models outperform SVM-based baselines by as much as 18% in terms of disambiguation accuracy translating to an increment of almost 11,000 correctly disambiguated references.

[1]  Ganesh Ramakrishnan,et al.  Collective annotation of Wikipedia entities in web text , 2009, KDD.

[2]  Max Welling,et al.  Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[3]  Xiaojin Zhu,et al.  A Topic Model for Word Sense Disambiguation , 2007, EMNLP.

[4]  Weiyi Meng,et al.  A Latent Topic Model for Complete Entity Resolution , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[5]  Stochastic Relaxation , 2014, Computer Vision, A Reference Guide.

[6]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[7]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[8]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[9]  Michal Rosen-Zvi,et al.  Hidden Topic Markov Models , 2007, AISTATS.

[10]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[11]  Wray L. Buntine Operations for Learning with Graphical Models , 1994, J. Artif. Intell. Res..

[12]  Radford M. Neal Bayesian Mixture Modeling , 1992 .

[13]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[14]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[15]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[17]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[18]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[19]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[20]  Rajeev Rastogi,et al.  Entity disambiguation with hierarchical topic models , 2011, KDD.

[21]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[22]  T. Minka Estimating a Dirichlet distribution , 2012 .

[23]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[24]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[25]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[26]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[27]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[28]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[30]  Andrew W. Moore,et al.  Making logistic regression a core data mining tool with TR-IRLS , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).