Clustering Scientific Publications Based on Citation Relations: A Systematic Comparison of Different Methods

Clustering methods are applied regularly in the bibliometric literature to identify research areas or scientific fields. These methods are for instance used to group publications into clusters based on their relations in a citation network. In the network science literature, many clustering methods, often referred to as graph partitioning or community detection techniques, have been developed. Focusing on the problem of clustering the publications in a citation network, we present a systematic comparison of the performance of a large number of these clustering methods. Using a number of different citation networks, some of them relatively small and others very large, we extensively study the statistical properties of the results provided by different methods. In addition, we also carry out an expert-based assessment of the results produced by different methods. The expert-based assessment focuses on publications in the field of scientometrics. Our findings seem to indicate that there is a trade-off between different properties that may be considered desirable for a good clustering of publications. Overall, map equation methods appear to perform best in our analysis, suggesting that these methods deserve more attention from the bibliometric community.

[1]  Martin Rosvall,et al.  An information-theoretic framework for resolving community structure in complex networks , 2007, Proceedings of the National Academy of Sciences.

[2]  Kevin W. Boyack,et al.  Including cited non-source items in a large-scale map of science: What difference does it make? , 2014, J. Informetrics.

[3]  J. Reichardt,et al.  Statistical mechanics of community detection. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[4]  Fergal Reid,et al.  Detecting highly overlapping community structure by greedy clique expansion , 2010, KDD 2010.

[5]  Nees Jan van Eck,et al.  Evaluation of the citation matching algorithms of CWTS and iFQ in comparison to the Web of science , 2015, J. Assoc. Inf. Sci. Technol..

[6]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Bart De Moor,et al.  Towards mapping library and information science , 2006, Inf. Process. Manag..

[8]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[9]  Dino Pedreschi,et al.  DEMON: a local-first discovery method for overlapping communities , 2012, KDD.

[10]  Kevin W. Boyack,et al.  Which Type of Citation Analysis Generates the Most Accurate Taxonomy of Scientific and Technical Knowledge? , 2015, J. Assoc. Inf. Sci. Technol..

[11]  Tiago P. Peixoto Model selection and hypothesis testing for large-scale network models with overlapping groups , 2014, ArXiv.

[12]  Ludo Waltman,et al.  A smart local moving algorithm for large-scale modularity-based community detection , 2013, The European Physical Journal B.

[13]  Robert L. Wolpert,et al.  Statistical Inference , 2019, Encyclopedia of Social Network Analysis and Mining.

[14]  M. Newman,et al.  Random graphs with arbitrary degree distributions and their applications. , 2000, Physical review. E, Statistical, nonlinear, and soft matter physics.

[15]  Ludo Waltman,et al.  A new methodology for constructing a publication-level classification system of science , 2012, J. Assoc. Inf. Sci. Technol..

[16]  F. Radicchi,et al.  Benchmark graphs for testing community detection algorithms. , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[17]  Andrea Lancichinetti,et al.  Erratum: Community detection algorithms: A comparative analysis [Phys. Rev. E 80, 056117 (2009)] , 2014 .

[18]  Andreas Noack,et al.  Multilevel local search algorithms for modularity clustering , 2011, JEAL.

[19]  Martin Rosvall,et al.  Multilevel Compression of Random Walks on Networks Reveals Hierarchical Organization in Large Integrated Systems , 2010, PloS one.

[20]  Claudio Castellano,et al.  Defining and identifying communities in networks. , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[21]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[22]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[23]  Ludvig Bohlin,et al.  Community detection and visualization of networks with the map equation framework , 2014 .

[24]  Matthieu Latapy,et al.  Computing Communities in Large Networks Using Random Walks , 2004, J. Graph Algorithms Appl..

[25]  Martin Rosvall,et al.  Maps of random walks on complex networks reveal community structure , 2007, Proceedings of the National Academy of Sciences.

[26]  Andrea Lancichinetti,et al.  Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[27]  Henry G. Small,et al.  Update on science mapping: Creating large document spaces , 1997, Scientometrics.

[28]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[29]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[30]  Jure Leskovec,et al.  Defining and evaluating network communities based on ground-truth , 2012, Knowledge and Information Systems.

[31]  M E J Newman,et al.  Fast algorithm for detecting community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[32]  David M Blei,et al.  Efficient discovery of overlapping communities in massive networks , 2013, Proceedings of the National Academy of Sciences.

[33]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Mark E. J. Newman,et al.  An efficient and principled method for detecting communities in networks , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[35]  Martin Rosvall,et al.  Resampling Effects on Significance Analysis of Network Clustering and Ranking , 2012, PloS one.

[36]  Santo Fortunato,et al.  Community detection in networks: Structural communities versus ground truth , 2014, Physical review. E, Statistical, nonlinear, and soft matter physics.

[37]  Christos Faloutsos,et al.  Graph evolution: Densification and shrinking diameters , 2006, TKDD.

[38]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[39]  Ludo Waltman,et al.  CitNetExplorer: A new software tool for analyzing and visualizing citation networks , 2014, J. Informetrics.

[40]  Bart De Moor,et al.  A hybrid mapping of information science , 2008, Scientometrics.

[41]  Sune Lehmann,et al.  Link communities reveal multiscale complexity in networks , 2009, Nature.

[42]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[43]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[44]  Tiago P. Peixoto The entropy of stochastic blockmodel ensembles , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[45]  Vincent A. Traag,et al.  Significant Scales in Community Structure , 2013, Scientific Reports.

[46]  Ed C. M. Noyons,et al.  A unified approach to mapping and clustering of bibliometric networks , 2010, J. Informetrics.

[47]  Andrea Lancichinetti,et al.  Community detection algorithms: a comparative analysis: invited presentation, extended abstract , 2009, VALUETOOLS.

[48]  J. Kumpula,et al.  Sequential algorithm for fast clique percolation. , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[49]  M. Newman,et al.  Robustness of community structure in networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[50]  Bo Jarneving,et al.  Bibliographic coupling and its application to research-front and other core documents , 2007, J. Informetrics.

[51]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[52]  Marko Bajec,et al.  Robust network community detection using balanced propagation , 2011, ArXiv.

[53]  Mark E. J. Newman,et al.  Stochastic blockmodels and community structure in networks , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[54]  B. C. Griffith,et al.  The Structure of Scientific Literatures I: Identifying and Graphing Specialties , 1974 .

[55]  Réka Albert,et al.  Near linear time algorithm to detect community structures in large-scale networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[56]  V A Traag,et al.  Narrow scope for resolution-limit-free community detection. , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[57]  Santo Fortunato,et al.  Finding Statistically Significant Communities in Networks , 2010, PloS one.

[58]  Steve Gregory,et al.  Finding overlapping communities in networks by label propagation , 2009, ArXiv.

[59]  Jure Leskovec,et al.  Overlapping community detection at scale: a nonnegative matrix factorization approach , 2013, WSDM.

[60]  Kevin W. Boyack,et al.  Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches , 2011, PloS one.

[61]  Marko Bajec,et al.  Group detection in complex networks: An algorithm and comparison of the state of the art , 2013, 1305.5136.

[62]  P. Ronhovde,et al.  Multiresolution community detection for megascale networks by information-based replica correlations. , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[63]  S. Fortunato,et al.  Resolution limit in community detection , 2006, Proceedings of the National Academy of Sciences.

[64]  Carl T. Bergstrom,et al.  Mapping Change in Large Networks , 2008, PloS one.

[65]  Kevin W. Boyack,et al.  Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? , 2010, J. Assoc. Inf. Sci. Technol..

[66]  Marko Bajec,et al.  Unfolding communities in large complex networks: Combining defensive and offensive label propagation for core extraction , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[67]  Jure Leskovec,et al.  Detecting cohesive and 2-mode communities indirected and undirected networks , 2014, WSDM.

[68]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.