A principled methodology for comparing relatedness measures for clustering publications

There are many different relatedness measures, based for instance on citation relations or textual similarity, that can be used to cluster scientific publications. We propose a principled methodology for evaluating the accuracy of clustering solutions obtained using these relatedness measures. We formally show that the proposed methodology has an important consistency property. The empirical analyses that we present are based on publications in the fields of cell biology, condensed matter physics, and economics. Using the BM25 text-based relatedness measure as the evaluation criterion, we find that bibliographic coupling relations yield more accurate clustering solutions than direct citation relations and cocitation relations. The so-called extended direct citation approach performs similarly to or slightly better than bibliographic coupling in terms of the accuracy of the resulting clustering solutions. The other way around, using a citation-based relatedness measure as evaluation criterion, BM25 turns out to yield more accurate clustering solutions than other text-based relatedness measures.

[1]  Javier Ruiz-Castillo,et al.  The comparison of normalization procedures based on different classification systems , 2013, J. Informetrics.

[2]  Ludo Waltman,et al.  A new methodology for constructing a publication-level classification system of science , 2012, J. Assoc. Inf. Sci. Technol..

[3]  Lutz Bornmann,et al.  Algorithmically generated subject categories based on citation relations: An empirical micro study using papers on overall water splitting and related topics , 2017, J. Informetrics.

[4]  Ludo Waltman,et al.  Constructing bibliometric networks: A comparison between full and fractional counting , 2016, J. Informetrics.

[5]  Antonio Perianes-Rodríguez,et al.  A comparison of the Web of Science and publication-level classification systems of science , 2017, J. Informetrics.

[6]  Wolfgang Glänzel,et al.  Same data—different results? Towards a comparative approach to the identification of thematic structures in science , 2017, Scientometrics.

[7]  H. Small,et al.  Identifying emerging topics in science and technology , 2014 .

[8]  Kevin W. Boyack,et al.  Including cited non-source items in a large-scale map of science: What difference does it make? , 2014, J. Informetrics.

[9]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[10]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[11]  Kevin W. Boyack,et al.  Which Type of Citation Analysis Generates the Most Accurate Taxonomy of Scientific and Technical Knowledge? , 2015, J. Assoc. Inf. Sci. Technol..

[12]  Kevin W. Boyack,et al.  Improving the accuracy of co-citation clustering using full text , 2013, J. Assoc. Inf. Sci. Technol..

[13]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..

[14]  Ludo Waltman,et al.  A smart local moving algorithm for large-scale modularity-based community detection , 2013, The European Physical Journal B.

[15]  Kevin W. Boyack,et al.  Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? , 2010, J. Assoc. Inf. Sci. Technol..

[16]  Henry G. Small,et al.  Update on science mapping: Creating large document spaces , 1997, Scientometrics.

[17]  Peter Sjögårde,et al.  Granularity of algorithmically constructed publication-level classifications of research publications: Identification of topics , 2018, J. Informetrics.

[18]  V A Traag,et al.  Narrow scope for resolution-limit-free community detection. , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[19]  S. Fortunato,et al.  Resolution limit in community detection , 2006, Proceedings of the National Academy of Sciences.

[20]  M E J Newman,et al.  Fast algorithm for detecting community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[21]  Wolfgang Glänzel,et al.  Same data—different results? Towards a comparative approach to the identification of thematic structures in science , 2017, Scientometrics.

[22]  Ludo Waltman,et al.  Field-normalized citation impact indicators using algorithmically constructed classification systems of science , 2015, J. Informetrics.

[23]  Ludo Waltman,et al.  CitNetExplorer: A new software tool for analyzing and visualizing citation networks , 2014, J. Informetrics.

[24]  Mary Inaba,et al.  A Simple Acceleration Method for the Louvain Algorithm , 2016 .

[25]  Kevin W. Boyack,et al.  Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches , 2011, PloS one.

[26]  Olle Persson,et al.  Identifying research themes with weighted direct citation links , 2010, J. Informetrics.

[27]  Daniel Halperin,et al.  Scalable and Efficient Flow-Based Community Detection for Large-Scale Graph Analysis , 2017, ACM Trans. Knowl. Discov. Data.

[28]  R Klavans,et al.  Accurately identifying topics using text: Mapping PubMed , 2018 .

[29]  Ludo Waltman,et al.  Clustering Scientific Publications Based on Citation Relations: A Systematic Comparison of Different Methods , 2015, PloS one.

[30]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[31]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..

[32]  Vincent A. Traag,et al.  From Louvain to Leiden: guaranteeing well-connected communities , 2018, Scientific Reports.