Tracking evolving communities in large linked networks

We are interested in tracking changes in large-scale data by periodically creating an agglomerative clustering and examining the evolution of clusters (communities) over time. We examine a large real-world data set: the NEC CiteSeer database, a linked network of >250,000 papers. Tracking changes over time requires a clustering algorithm that produces clusters stable under small perturbations of the input data. However, small perturbations of the CiteSeer data lead to significant changes to most of the clusters. One reason for this is that the order in which papers within communities are combined is somewhat arbitrary. However, certain subsets of papers, called natural communities, correspond to real structure in the CiteSeer database and thus appear in any clustering. By identifying the subset of clusters that remain stable under multiple clustering runs, we get the set of natural communities that we can track over time. We demonstrate that such natural communities allow us to identify emerging communities and track temporal changes in the underlying structure of our network data.

[1]  Duncan J. Watts,et al.  Six Degrees: The Science of a Connected Age , 2003 .

[2]  Mark A. McComb A Practical Guide to Heavy Tails , 2000, Technometrics.

[3]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[4]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[5]  Joao Antonio Pereira,et al.  Linked: The new science of networks , 2002 .

[6]  N. Cozzarelli PNAS Early Edition. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[8]  Murad S. Taqqu,et al.  A Practical Guide to Heavy Tails: Statistical Techniques for Analysing Heavy-Tailed Distributions , 1998 .

[9]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[10]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[11]  Robert L. Goldstone,et al.  The simultaneous evolution of author and paper networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[13]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[14]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[15]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..