Optimal and hierarchical clustering of large-scale hybrid networks for scientific mapping

Previous studies have shown that hybrid clustering methods based on textual and citation information outperforms clustering methods that use only one of these components. However, former methods focus on the vector space model. In this paper we apply a hybrid clustering method which is based on the graph model to map the Web of Science database in the mirror of the journals covered by the database. Compared with former hybrid clustering strategies, our method is very fast and even achieves better clustering accuracy. In addition, it detects the number of clusters automatically and provides a top-down hierarchical analysis, which fits in with the practical application. We quantitatively and qualitatively asses the added value of such an integrated analysis and we investigate whether the clustering outcome provides an appropriate representation of the field structure by comparing with a text-only or citation-only clustering and with another hybrid method based on linear combination of distance matrices. Our dataset consists of about 8,000 journals published in the period 2002–2006. The cognitive analysis, including the ranked journals, term annotation and the visualization of cluster structure demonstrates the efficiency of our strategy.

[1]  M. Newman,et al.  Finding community structure in networks using the eigenvectors of matrices. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[2]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[3]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[4]  Renaud Lambiotte,et al.  Communities, knowledge creation, and information diffusion , 2009, J. Informetrics.

[5]  M. Newman Analysis of weighted networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[6]  Masaru Kitsuregawa,et al.  D-3 An Link-Contents Coupled Clustering for Web Search Results , 2002 .

[7]  Bart De Moor,et al.  Combining full text and bibliometric information in mapping scientific disciplines , 2005, Inf. Process. Manag..

[8]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[9]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[10]  Bart De Moor,et al.  Hybrid clustering for validation and improvement of subject-classification schemes , 2009, Inf. Process. Manag..

[11]  Wolfgang Glänzel,et al.  Subject clustering analysis based on ISI category classification , 2010, J. Informetrics.

[12]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Henk F. Moed,et al.  Mapping of Science by Combined Co-Citation and Word Analysis. I. Structural Aspects , 1991 .

[14]  Berthier A. Ribeiro-Neto,et al.  Local versus global link information in the Web , 2003, TOIS.

[15]  Bart De Moor,et al.  Integration of textual content and link information for accurate clustering of science fields , 2006 .

[16]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[17]  Henk F. Moed,et al.  Mapping of science by combined co-citation and word analysis. II: Dynamical aspects , 1991 .

[18]  Wolfgang Glänzel,et al.  Combining full-text analysis and bibliometric indicators , 2004 .

[19]  Mason A. Porter,et al.  Communities in Networks , 2009, ArXiv.

[20]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[21]  Ismael Rafols,et al.  A global map of science based on the ISI subject categories , 2009, J. Assoc. Inf. Sci. Technol..

[22]  Nello Cristianini,et al.  Composite Kernels for Hypertext Categorisation , 2001, ICML.

[23]  F. Calabrese,et al.  Urban gravity: a model for inter-city telecommunication flows , 2009, 0905.0692.

[24]  Bart De Moor,et al.  A hybrid mapping of information science , 2008, Scientometrics.

[25]  Nivio Ziviani,et al.  Link-based similarity measures for the classification of Web documents , 2006 .

[26]  Hongyuan Zha,et al.  Web document clustering using hyperlink structures , 2001 .

[27]  Masaru Kitsuregawa,et al.  Evaluating contents-link coupled web page clustering for web search results , 2002, CIKM '02.

[28]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[29]  W. Scott Spangler,et al.  Clustering hypertext with applications to web searching , 2000, HYPERTEXT '00.

[30]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[31]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[32]  Bart De Moor,et al.  A hierarchical and optimal clustering of WoS journal database by hybrid information , 2011 .

[33]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[34]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[35]  Jean-Loup Guillaume,et al.  Fast unfolding of community hierarchies in large networks , 2008, ArXiv.

[36]  L. Hubert,et al.  Comparing partitions , 1985 .

[37]  Bart De Moor,et al.  Towards mapping library and information science , 2006, Inf. Process. Manag..

[38]  William E. Snizek,et al.  Textual and nontextual characteristics of scientific papers: Neglected science indicators , 2005, Scientometrics.

[39]  Nicholas C. Mullins,et al.  THE STRUCTURAL ANALYSIS OF A SCIENTIFIC PAPER , 1988 .