Learning entity-centric document representations using an entity facet topic model

Abstract Learning semantic representations of documents is essential for various downstream applications, including text classification and information retrieval. Entities, as important sources of information, have been playing a crucial role in assisting latent representations of documents. In this work, we hypothesize that entities are not monolithic concepts; instead they have multiple aspects, and different documents may be discussing different aspects of a given entity. Given that, we argue that from an entity-centric point of view, a document related to multiple entities shall be (a) represented differently for different entities (multiple entity-centric representations), and (b) each entity-centric representation should reflect the specific aspects of the entity discussed in the document. In this work, we devise the following research questions: (1) Can we confirm that entities have multiple aspects, with different aspects reflected in different documents, (2) can we learn a representation of entity aspects from a collection of documents, and a representation of document based on the multiple entities and their aspects as reflected in the documents, (3) does this novel representation improves algorithm performance in downstream applications, and (4) what is a reasonable number of aspects per entity? To answer these questions we model each entity using multiple aspects (entity facets 1 ), where each entity facet is represented as a mixture of latent topics. Then, given a document associated with multiple entities, we assume multiple entity-centric representations, where each entity-centric representation is a mixture of entity facets for each entity. Finally, a novel graphical model, the Entity Facet Topic Model (EFTM), is proposed in order to learn entity-centric document representations, entity facets, and latent topics. Through experimentation we confirm that (1) entities are multi-faceted concepts which we can model and learn, (2) a multi-faceted entity-centric modeling of documents can lead to effective representations, which (3) can have an impact in downstream application, and (4) considering a small number of facets is effective enough. In particular, we visualize entity facets within a set of documents, and demonstrate that indeed different sets of documents reflect different facets of entities. Further, we demonstrate that the proposed entity facet topic model generates better document representations in terms of perplexity, compared to state-of-the-art document representation methods. Moreover, we show that the proposed model outperforms baseline methods in the application of multi-label classification. Finally, we study the impact of EFTM’s parameters and find that a small number of facets better captures entity specific topics, which confirms the intuition that on average an entity has a small number of facets reflected in documents.

[1]  Xiaohua Hu,et al.  Product review summarization through question retrieval and diversification , 2017, Information Retrieval Journal.

[2]  M. de Rijke,et al.  Learning Latent Vector Spaces for Product Search , 2016, CIKM.

[3]  Grigorios Tsoumakas,et al.  Multilabel Text Classification for Automated Tag Suggestion , 2008 .

[4]  Jing Zhang,et al.  o-HETM: An Online Hierarchical Entity Topic Model for News Streams , 2015, PAKDD.

[5]  Marcel Worring,et al.  Unsupervised, Efficient and Semantic Expertise Retrieval , 2016, WWW.

[6]  Yue Wang,et al.  Filtering out the noise in short text topic modeling , 2018, Inf. Sci..

[7]  Jihong Ouyang,et al.  Centroid prior topic model for multi-label classification , 2015, Pattern Recognit. Lett..

[8]  M. de Rijke,et al.  Neural Vector Spaces for Unsupervised Information Retrieval , 2017, ACM Trans. Inf. Syst..

[9]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[10]  Jihong Ouyang,et al.  Supervised topic models for multi-label classification , 2015, Neurocomputing.

[11]  Hakim Hacid,et al.  PerSaDoR: Personalized social document representation for improving web search , 2016, Inf. Sci..

[12]  W. Bruce Croft Document representation in probabilistic models of information retrieval , 1981, J. Am. Soc. Inf. Sci..

[13]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[14]  Tie-Yan Liu,et al.  Word-Entity Duet Representations for Document Ranking , 2017, SIGIR.

[15]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[16]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[17]  Ding Xiao,et al.  Coupled matrix factorization and topic modeling for aspect mining , 2018, Inf. Process. Manag..

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[20]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[21]  Padhraic Smyth,et al.  Statistical entity-topic models , 2006, KDD '06.

[22]  Anísio Lacerda,et al.  A general framework to expand short text for topic modeling , 2017, Inf. Sci..

[23]  Yinglin Wang,et al.  Automatically building templates for entity summary construction , 2013, Inf. Process. Manag..

[24]  Ivan Titov,et al.  Modeling online reviews with multi-grain topic models , 2008, WWW.

[25]  Jian Xing,et al.  Effective Document Labeling with Very Few Seed Words: A Topic Model Approach , 2016, CIKM.

[26]  Hua Yuan,et al.  Semantic Search for Public Opinions on Urban Affairs: A Probabilistic Topic Modeling-Based Approach , 2016, Inf. Process. Manag..

[27]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[28]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[29]  Frank D. Wood,et al.  Hierarchically Supervised Latent Dirichlet Allocation , 2011, NIPS.

[30]  Clement T. Yu,et al.  A theory of term importance in automatic text analysis , 1974, J. Am. Soc. Inf. Sci..

[31]  Krisztian Balog,et al.  Entity-Oriented Search , 2018, The Information Retrieval Series.

[32]  David M. Blei,et al.  Connections between the lines: augmenting social networks with text , 2009, KDD.

[33]  Wai Lam,et al.  Review-Aware Answer Prediction for Product-Related Questions Incorporating Aspects , 2018, WSDM.

[34]  Wenji Mao,et al.  A Non-Parametric Topic Model for Short Texts Incorporating Word Coherence Knowledge , 2016, CIKM.

[35]  Minmin Chen,et al.  Efficient Vector Representation for Documents through Corruption , 2017, ICLR.

[36]  Concha Bielza,et al.  Multi-dimensional classification with Bayesian networks , 2011, Int. J. Approx. Reason..

[37]  Maarten de Rijke,et al.  Identifying entity aspects in microblog posts , 2012, SIGIR '12.

[38]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[39]  Yueting Zhuang,et al.  Entity mention aware document representation , 2018, Inf. Sci..

[40]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Maarten de Rijke,et al.  Mining, Ranking and Recommending Entity Aspects , 2015, SIGIR.

[42]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[43]  J. N. K. Rao,et al.  Pseudo–Empirical Likelihood Inference for Multiple Frame Surveys , 2010 .

[44]  Daniel Gillick,et al.  A New Entity Salience Task with Millions of Training Examples , 2014, EACL.

[45]  Xianpei Han,et al.  An Entity-Topic Model for Entity Linking , 2012, EMNLP.

[46]  Gabriella Kazai,et al.  Personalised News and Blog Recommendations based on User Location, Facebook and Twitter User Profiling , 2016, SIGIR.

[47]  Hong Shen,et al.  User clustering in a dynamic social network topic model for short text streams , 2017, Inf. Sci..

[48]  Meng Wang,et al.  Aspect Ranking: Identifying Important Product Aspects from Online Consumer Reviews , 2011, ACL.

[49]  Pawan Goyal,et al.  PEQ: An Explainable, Specification-based, Aspect-oriented Product Comparator for E-commerce , 2016, CIKM.

[50]  Andi Rexha,et al.  An unsupervised aspect extraction strategy for monitoring real-time reviews stream , 2019, Inf. Process. Manag..

[51]  M. de Rijke,et al.  Document Filtering for Long-tail Entities , 2016, CIKM.

[52]  SangKeun Lee,et al.  Joint multi-grain topic sentiment: modeling semantic aspects for online reviews , 2016, Inf. Sci..

[53]  Yinglin Wang,et al.  Generating Aspect-oriented Multi-Document Summarization with Event-aspect model , 2011, EMNLP.

[54]  Simone Paolo Ponzetto,et al.  Entity-Aspect Linking: Providing Fine-Grained Semantics of Entities in Context , 2018, JCDL.

[55]  Oren Kurland,et al.  Document Retrieval Using Entity-Based Language Models , 2016, SIGIR.

[56]  Avishek Anand,et al.  Automated News Suggestions for Populating Wikipedia Entity Pages , 2015, CIKM.

[57]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[58]  Björn Buchhold,et al.  Semantic Search on Text and Knowledge Bases , 2016, Found. Trends Inf. Retr..

[59]  Yizhou Sun,et al.  ETM: Entity Topic Models for Mining Documents Associated with Entities , 2012, 2012 IEEE 12th International Conference on Data Mining.

[60]  Xin Li,et al.  Tag-based social interest discovery , 2008, WWW.

[61]  Robert West,et al.  Structuring Wikipedia Articles with Section Recommendations , 2018, SIGIR.

[62]  Alexander M. Rush,et al.  An Embedding Model for Predicting Roll-Call Votes , 2016, EMNLP.