Finding Cohesive Clusters for Analyzing Knowledge Communities

Documents and authors can be clustered into "knowledge communities" based on the overlap in the papers they cite. We introduce a new clustering algorithm, Streemer, which finds cohesive foreground clusters embedded in a diffuse background, and use it to identify knowledge communities as foreground clusters of papers which share common citations. To analyze the evolution of these communities over time, we build predictive models with features based on the citation structure, the vocabulary of the papers, and the affiliations and prestige of the authors. Findings include that scientific knowledge communities tend to grow more rapidly if their publications build on diverse information and if they use a narrow vocabulary.

[1]  B. C. Griffith,et al.  The Structure of Scientific Literatures II: Toward a Macro- and Microstructure for Science , 1974 .

[2]  Diana Crane,et al.  Invisible colleges. Diffusion of knowledge in scientific communities , 1972, Medical History.

[3]  E. J. Barboni,et al.  Co-Citation Analyses of Science: An Evaluation , 1977 .

[4]  Qian Huang,et al.  Foreground/background segmentation of color images by integration of multiple cues , 1995, Proceedings., International Conference on Image Processing.

[5]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[6]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[7]  Yishay Mansour,et al.  An Information-Theoretic Analysis of Hard and Soft Assignment Methods for Clustering , 1997, UAI.

[8]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[9]  Andreas E. Savakis,et al.  Adaptive document image thresholding using foreground and background clustering , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[10]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[11]  C. Lee Giles,et al.  Clustering and identifying temporal trends in document databases , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[12]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[13]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[14]  Patrick Pantel,et al.  Document clustering with committees , 2002, SIGIR '02.

[15]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[16]  A. McGann The Advantages of Ideological Cohesion , 2002 .

[17]  Bart Selman,et al.  Natural communities in large linked networks , 2003, KDD '03.

[18]  Henry G. Small,et al.  Paradigms, citations, and maps of science: A personal history , 2003, J. Assoc. Inf. Sci. Technol..

[19]  David D. Jensen,et al.  Exploiting relational structure to understand publication patterns in high-energy physics , 2003, SKDD.

[20]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[21]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[22]  Inderjit S. Dhillon,et al.  Information theoretic clustering of sparse cooccurrence data , 2003, Third IEEE International Conference on Data Mining.

[23]  Henry G. Small,et al.  Specialties and disciplines in science and social science: An examination of their structure using citation indexes , 1979, Scientometrics.

[24]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[25]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.