Scalable community discovery on textual data with relations

Every piece of textual data is generated as a method to convey its authors' opinion regarding specific topics. Authors deliberately organize their writings and create links, i.e., references, acknowledgments, for better expression. Thereafter, it is of interest to study texts as well as their relations to understand the underlying topics and communities. Although many efforts exist in the literature in data clustering and topic mining, they are not applicable to community discovery on large document corpus for several reasons. First, few of them consider both textual attributes as well as relations. Second, scalability remains a significant issue for large-scale datasets. Additionally, most algorithms rely on a set of initial parameters that are hard to be captured and tuned. Motivated by the aforementioned observations, a hierarchical community model is proposed in the paper which distinguishes community cores from affiliated members. We present our efforts to develop a scalable community discovery solution for large-scale document corpus. Our proposal tries to quickly identify potential cores as seeds of communities through relation analysis. To eliminate the influence of initial parameters, an innovative attribute-based core merge process is introduced so that the algorithm promises to return consistent communities regardless initial parameters. Experimental results suggest that the proposed method has high scalability to corpus size and feature dimensionality, with more than 15 topical precision improvement compared with popular clustering techniques.

[1]  Yun Chi,et al.  Structural and temporal analysis of the blogosphere through community factorization , 2007, KDD '07.

[2]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[3]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[4]  Wei-Ying Ma,et al.  A Concentric-Circle Model for Community Mining in Graph Structures , 2002 .

[5]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[6]  David Harel,et al.  Clustering spatial data using random walks , 2001, KDD '01.

[7]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[8]  Tie-Yan Liu,et al.  Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering , 2005, KDD '05.

[9]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[10]  Thomas L. Griffiths,et al.  Learning author-topic models from text corpora , 2010, TOIS.

[11]  Edoardo M. Airoldi,et al.  A latent mixed membership model for relational data , 2005, LinkKDD '05.

[12]  Rong Ge,et al.  Joint cluster analysis of attribute and relationship data withouta-priori specification of the number of clusters , 2007, KDD '07.

[13]  Andrew McCallum,et al.  Group and topic discovery from relations and text , 2005, LinkKDD '05.

[14]  Masaru Kitsuregawa,et al.  WEB Community Mining and WEB Log Mining: Commodity Cluster Based Execution , 2002, Australasian Database Conference.

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Marco Pellegrini,et al.  Extraction and classification of dense communities in the web , 2007, WWW '07.

[17]  C. Lee Giles,et al.  Clustering and identifying temporal trends in document databases , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[18]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[19]  Andrew McCallum,et al.  Expertise modeling for matching papers with reviewers , 2007, KDD '07.

[20]  David Cohn,et al.  Learning to Probabilistically Identify Authoritative Documents , 2000, ICML.

[21]  Philip S. Yu,et al.  A probabilistic framework for relational clustering , 2007, KDD '07.

[22]  Jon M Kleinberg,et al.  Hubs, authorities, and communities , 1999, CSUR.

[23]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.