Discovering Correlated Entities from News Archives

Most textual documents contain references to real-word entities such as people, locations and organizations. The understanding of their correlations is behind many applications including social relationship construction platform and major search engines, etc. This paper aims to discover entity correlations from news archives by means of the proposed hierarchical Entity Topic Model (hETM). hETM is a semantic-based analysis model which follows the gist of probabilistic topic models and in which a directed acyclic graph (DAG) is leveraged to capture arbitrary topic correlations. Entity extraction is taken as a preprocessing step of our model and we then employ different generative processes for ordinary words and entities. The discovering of entity correlations is achieved via the analysis of the dependencies between entities and their associated topics as well as topic correlations. We evaluate the approach upon BBC news dataset and results demonstrate the higher quality of discovered entity correlations compared with existing methods.

[1]  Takenao Ohkawa,et al.  Entity Network Prediction Using Multitype Topic Models , 2008, IEICE Trans. Inf. Syst..

[2]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[3]  Weiyi Meng,et al.  A Latent Topic Model for Complete Entity Resolution , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[4]  Wei Li,et al.  Mixtures of hierarchical topics with Pachinko allocation , 2007, ICML '07.

[5]  Andrew M. Dai,et al.  The Grouped Author-Topic Model for Unsupervised Entity Resolution , 2011, ICANN.

[6]  Tanja Schultz,et al.  Correlated Latent Semantic Model for Unsupervised LM Adaptation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  Padhraic Smyth,et al.  Statistical entity-topic models , 2006, KDD '06.

[8]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[9]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[10]  Hang Li,et al.  Named entity recognition in query , 2009, SIGIR.

[11]  Zhimin Zhang,et al.  Using search session context for named entity recognition in query , 2010, SIGIR.

[12]  Rajeev Rastogi,et al.  Entity disambiguation with hierarchical topic models , 2011, KDD.

[13]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[14]  Hang Li,et al.  Named entity mining from click-through data using weakly supervised latent dirichlet allocation , 2009, KDD.

[15]  J. Kleinfeld COULD IT BE A BIG WORLD AFTER ALL? THE "SIX DEGREES OF SEPARATION" MYTH , 2002 .

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[18]  Dongwon Lee,et al.  On six degrees of separation in DBLP-DB and more , 2005, SGMD.

[19]  Christopher K. I. Williams,et al.  Transformation Equivariant Boltzmann Machines , 2011, ICANN.