Weighted hybrid clustering by combining text mining and bibliometrics on a large-scale journal database

We propose a new hybrid clustering framework to incorporate text mining with bibliometrics in journal set analysis.The framework integrates two different approaches: clustering ensemble and kernel-fusion clustering. To improve the flexibility and the efficiency of processing large-scale data, we propose an information-based weighting scheme to leverage the effect of multiple data sources in hybrid clustering. Three different algorithms are extended by the proposed weighting scheme and they are employed on a large journal set retrieved from the Web of Science (WoS) database. The clustering performance of the proposed algorithms is systematically evaluated using multiple evaluation methods, and they were cross-compared with alternative methods. Experimental results demonstrate that the proposed weighted hybrid clustering strategy is superior to other methods in clustering performance and efficiency. The proposed approach also provides a more refined structural mapping of journal sets, which is useful for monitoring and detecting new trends in different scientific fields.

[1]  Anil K. Jain,et al.  Clustering ensembles: models of consensus and weak partitions , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  W. Scott Spangler,et al.  Clustering hypertext with applications to web searching , 2000, HYPERTEXT '00.

[3]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[4]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[5]  Loet Leydesdorff,et al.  Can scientific journals be classified in terms of aggregated journal-journal citation relations using the Journal Citation Reports? , 2009, J. Assoc. Inf. Sci. Technol..

[6]  Frizo A. L. Janssens,et al.  Clustering of scientific fields by integrating text mining and bibliometrics , 2007 .

[7]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Michel Zitt,et al.  Development of a method for detection and trend analysis of research fronts built by lexical or cocitation analysis , 1994, Scientometrics.

[10]  Bart De Moor,et al.  Hybrid clustering for validation and improvement of subject-classification schemes , 2009, Inf. Process. Manag..

[11]  Vladimir Batagelj,et al.  Pajek - Analysis and Visualization of Large Networks , 2001, Graph Drawing Software.

[12]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[13]  Kevin W. Boyack,et al.  Mapping the structure and evolution of chemistry research , 2009, Scientometrics.

[14]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[15]  Henk F. Moed,et al.  Mapping of science by combined co-citation and word analysis, I. Structural aspects , 1991, J. Am. Soc. Inf. Sci..

[16]  Hongyuan Zha,et al.  Web document clustering using hyperlink structures , 2001 .

[17]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[18]  Ismael Rafols,et al.  A global map of science based on the ISI subject categories , 2009, J. Assoc. Inf. Sci. Technol..

[19]  L. Hubert,et al.  Comparing partitions , 1985 .

[20]  Bart De Moor,et al.  Towards mapping library and information science , 2006, Inf. Process. Manag..

[21]  Bart De Moor,et al.  Hybrid Clustering of Text Mining and Bibliometrics Applied to Journal Sets , 2009, SDM.

[22]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[23]  Mohamed S. Kamel,et al.  Cumulative Voting Consensus Method for Partitions with Variable Number of Clusters , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Bart De Moor,et al.  A hybrid mapping of information science , 2008, Scientometrics.

[25]  Henry Small,et al.  Cited Documents as Concept Symbols , 1978 .

[26]  Masaru Kitsuregawa,et al.  Evaluating contents-link coupled web page clustering for web search results , 2002, CIKM '02.

[27]  Bart De Moor,et al.  Combining full text and bibliometric information in mapping scientific disciplines , 2005, Inf. Process. Manag..

[28]  Chris H. Q. Ding,et al.  Web document clustering using hyperlink structures , 2001, Comput. Stat. Data Anal..

[29]  Wolfgang Glänzel,et al.  A new classification scheme of science fields and subfields designed for scientometric evaluation purposes , 2004, Scientometrics.

[30]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[31]  Zaida Chinchilla-Rodríguez,et al.  Visualizing the marrow of science , 2007 .

[32]  Steffen Bickel,et al.  Multi-view clustering , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[33]  Boris G. Mirkin,et al.  Reinterpreting the Category Utility Function , 2001, Machine Learning.