Mining topics on participations for community discovery

Community discovery on large-scale linked document corpora has been a hot research topic for decades. There are two types of links. The first one, which we call d2d-link, indicates connectiveness among different documents, such as blog references and research paper citations. The other one, which we call u2u-link, represents co-occurrences or simultaneous participations of different users in one document and typically each document from u2u-link corpus has more than one user/author. Examples of u2u-link data covers email archives and research paper co-authorship networks. Community discovery in d2d-link data has achieved much success, while methods for that in u2u-link data either make no use of the textual content of the documents or make oversimplified assumptions about the users and the textual content. In this paper we propose a general approach of community discovery for u2u-link data, i.e., multiple user data, by placing topical variables on multiple authors' participations in documents. Experiments on a research proceeding co-authorship corpus and a New York Times news corpus show the effectiveness of our model.

[1]  M. Newman Coauthorship networks and patterns of scientific collaboration , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Jure Leskovec,et al.  Empirical comparison of algorithms for network community detection , 2010, WWW '10.

[3]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Ji-Rong Wen,et al.  Scalable community discovery on textual data with relations , 2008, CIKM '08.

[6]  Kevin J. Lang,et al.  Communities from seed sets , 2006, WWW '06.

[7]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[8]  Thomas L. Griffiths,et al.  Discovering Latent Classes in Relational Data , 2004 .

[9]  Andrew McCallum,et al.  Topic and Role Discovery in Social Networks with Experiments on Enron and Academic Email , 2007, J. Artif. Intell. Res..

[10]  Hongyuan Zha,et al.  A new Mallows distance based metric for comparing clusterings , 2005, ICML '05.

[11]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[12]  Dennis M. Wilkinson,et al.  A method for finding communities of related genes , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Yan Liu,et al.  Topic-link LDA: joint models of topic and author community , 2009, ICML '09.

[14]  T. Vicsek,et al.  Uncovering the overlapping community structure of complex networks in nature and society , 2005, Nature.

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.

[17]  Yun Chi,et al.  Facetnet: a framework for analyzing communities and their evolutions in dynamic networks , 2008, WWW.

[18]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[19]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[20]  Hongyuan Zha,et al.  Probabilistic models for discovering e-communities , 2006, WWW '06.

[21]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[22]  Andrew McCallum,et al.  Joint Group and Topic Discovery from Relations and Text , 2006, SNA@ICML.

[23]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Weixiong Zhang,et al.  An Efficient Spectral Algorithm for Network Community Discovery and Its Applications to Biological and Social Networks , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[25]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[26]  T. Snijders,et al.  Estimation and Prediction for Stochastic Blockstructures , 2001 .