HDPauthor: A New Hybrid Author-Topic Model using Latent Dirichlet Allocation and Hierarchical Dirichlet Processes

We present a new approach towards capturing topic interests corresponding to all the observed latent topics generated by an author in documents to which he or she has contributed. Topic models based on Latent Dirichlet Allocation (LDA) have been built for this purpose but are brittle as to the number of topics allowed for a collection and for each author of documents within the collection. Meanwhile, topic models based upon Hierarchical Dirichlet Processes (HDPs) allow an arbitrary number of topics to be discovered and generative distributions of interest inferred from text corpora, but this approach is not directly extensible to generative models of authors as contributors to documents with variable topical expertise. Our approach combines an existing HDP framework for learning topics from free text with latent authorship learning within a generative model using author list information. This model adds another layer into the current hierarchy of HDPs to represent topic groups shared by authors, and the document topic distribution is represented as a mixture of topic distribution of its authors. Our model automatically learns author contribution partitions for documents in addition to topics.

[1]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[2]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[3]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[4]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..

[5]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[6]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[7]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..

[8]  Andrew M. Dai,et al.  The Grouped Author-Topic Model for Unsupervised Entity Resolution , 2011, ICANN.

[9]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[10]  Yang Song,et al.  Efficient topic-based unsupervised name disambiguation , 2007, JCDL '07.

[11]  Vladimir Batagelj,et al.  Efficient Algorithms for Citation Network Analysis , 2003, ArXiv.

[12]  Hinrich Schütze,et al.  Introduction to Information Retrieval: Preface , 2008 .

[13]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[14]  Thomas L. Griffiths,et al.  Learning author-topic models from text corpora , 2010, TOIS.

[15]  W. Hsu,et al.  HDPsent : Incorporation of Latent Dirichlet Allocation for Aspect-Level Sentiment into Hierarchical Dirichlet Process-Based Topic Models , 2016 .

[16]  Ming Yang,et al.  Hierarchical Bayesian topic modeling with sentiment and author extension , 2016 .

[17]  Andrew M. Dai,et al.  Author Disambiguation: A Nonparametric Topic andCo-authorship Model , 2009 .

[18]  David M. Blei,et al.  Nonparametric variational inference , 2012, ICML.