Finding cohesive clusters for analyzing knowledge communities

Documents and authors can be clustered into “knowledge communities” based on the overlap in the papers they cite. We introduce a new clustering algorithm, Streemer, which finds cohesive foreground clusters embedded in a diffuse background, and use it to identify knowledge communities as foreground clusters of papers which share common citations. To analyze the evolution of these communities over time, we build predictive models with features based on the citation structure, the vocabulary of the papers, and the affiliations and prestige of the authors. Findings include that scientific knowledge communities tend to grow more rapidly if their publications build on diverse information and if they use a narrow vocabulary.

[1]  Bart Selman,et al.  Natural communities in large linked networks , 2003, KDD '03.

[2]  David D. Jensen,et al.  Exploiting relational structure to understand publication patterns in high-energy physics , 2003, SKDD.

[3]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[4]  Henry G. Small,et al.  Specialties and disciplines in science and social science: An examination of their structure using citation indexes , 1979, Scientometrics.

[5]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[6]  Henry G. Small,et al.  Paradigms, citations, and maps of science: A personal history , 2003, J. Assoc. Inf. Sci. Technol..

[7]  Inderjit S. Dhillon,et al.  Information theoretic clustering of sparse cooccurrence data , 2003, Third IEEE International Conference on Data Mining.

[8]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[9]  Yishay Mansour,et al.  An Information-Theoretic Analysis of Hard and Soft Assignment Methods for Clustering , 1997, UAI.

[10]  Samuel Phineas Upham Communities of innovation: Three essays on new knowledge development , 2006 .

[11]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[12]  E. J. Barboni,et al.  Co-Citation Analyses of Science: An Evaluation , 1977 .

[13]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[14]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[15]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[16]  Qian Huang,et al.  Foreground/background segmentation of color images by integration of multiple cues , 1995, Proceedings., International Conference on Image Processing.

[17]  Diana Crane,et al.  Invisible colleges. Diffusion of knowledge in scientific communities , 1972, Medical History.

[18]  A. McGann The Advantages of Ideological Cohesion , 2002 .

[19]  Vasileios Kandylas,et al.  Finding Cohesive Clusters for Analyzing Knowledge Communities , 2007, ICDM.

[20]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[21]  Andreas E. Savakis,et al.  Adaptive document image thresholding using foreground and background clustering , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[22]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[23]  Patrick Pantel,et al.  Document clustering with committees , 2002, SIGIR '02.

[24]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[25]  C. Lee Giles,et al.  Clustering and identifying temporal trends in document databases , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[26]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[27]  B. C. Griffith,et al.  The Structure of Scientific Literatures II: Toward a Macro- and Microstructure for Science , 1974 .