Return to basics: Clustering of scientific literature using structural information

Scholars frequently employ relatedness measures to estimate the similarity between two different items (e.g., documents, authors, and institutes). Such relatedness measures are commonly based on overlapping references ($\textit{i.e.}$, bibliographic coupling) or citations ($\textit{i.e.}$, co-citation) and can then be used with cluster analysis to find boundaries between research fields. Unfortunately, calculating a relatedness measure is challenging, especially for a large number of items, because the computational complexity is greater than linear. We propose an alternative method for identifying the research front that uses direct citation inspired by relatedness measures. Our novel approach simply replicates a node into two distinct nodes: a citing node and cited node. We then apply typical clustering methods to the modified network. Clusters of citing nodes should emulate those from the bibliographic coupling relatedness network, while clusters of cited nodes should act like those from the co-citation relatedness network. In validation tests, our proposed method demonstrated high levels of similarity with conventional relatedness-based methods. We also found that the clustering results of proposed method outperformed those of conventional relatedness-based measures regarding similarity with natural language processing--based classification.

[1]  E A Leicht,et al.  Community structure in directed networks. , 2007, Physical review letters.

[2]  Wolfgang Glänzel,et al.  Using ‘core documents’ for the representation of clusters and topics , 2011, Scientometrics.

[3]  Katherine W. McCain,et al.  Visualizing a discipline: an author co-citation analysis of information science, 1972–1995 , 1998 .

[4]  Ronald Rousseau,et al.  Requirements for a cocitation similarity measure, with special reference to Pearson's correlation coefficient , 2003, J. Assoc. Inf. Sci. Technol..

[5]  Howard D. White,et al.  Author cocitation: A literature measure of intellectual structure , 1981, J. Am. Soc. Inf. Sci..

[6]  Vincent A. Traag,et al.  From Louvain to Leiden: guaranteeing well-connected communities , 2018, Scientific Reports.

[7]  Kevin W. Boyack,et al.  A principled methodology for comparing relatedness measures for clustering publications , 2019, ISSI.

[8]  Kevin W. Boyack,et al.  Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? , 2010, J. Assoc. Inf. Sci. Technol..

[9]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[10]  Kevin W. Boyack,et al.  The Closer the Better: Similarity of Publication Pairs at Different Cocitation Levels , 2017, J. Assoc. Inf. Sci. Technol..

[11]  Katherine W. McCain,et al.  Mapping authors in intellectual space: A technical overview , 1990, J. Am. Soc. Inf. Sci..

[12]  Henry G. Small,et al.  Update on science mapping: Creating large document spaces , 1997, Scientometrics.

[13]  Carl T. Bergstrom,et al.  The map equation , 2009, 0906.1405.

[14]  D. Cases,et al.  How can we investigate citation behavior?: a study of reasons for citing literature in communication , 2000 .

[15]  Edgar Schiebel,et al.  Do second-order similarities provide added-value in a hybrid approach? , 2013, Scientometrics.

[16]  Yuxiao Dong,et al.  A Review of Microsoft Academic Services for Science of Science Studies , 2019, Front. Big Data.

[17]  Michael E. D. Koenig,et al.  Journal clustering using a bibliographic coupling method , 1977, Inf. Process. Manag..

[18]  Yves Gingras,et al.  A new approach for detecting scientific specialties from raw cocitation networks , 2009 .

[19]  Ludo Waltman,et al.  A new methodology for constructing a publication-level classification system of science , 2012, J. Assoc. Inf. Sci. Technol..

[20]  Yong-Yeol Ahn,et al.  CluSim: a python package for calculating clustering similarity , 2019, J. Open Source Softw..

[21]  S. Morris,et al.  Mapping research specialties , 2008 .

[22]  Yi Bu,et al.  Combining multiple scholarly relationships with author cocitation analysis: A preliminary exploration on improving knowledge domain mappings , 2017, J. Informetrics.

[23]  Yu-Wei Chang,et al.  Evolution of research subjects in library and information science based on keyword, bibliographical coupling, and co-citation analyses , 2015, Scientometrics.

[24]  Henry G. Small,et al.  Visualizing Science by Citation Mapping , 1999, J. Am. Soc. Inf. Sci..

[25]  M. M. Kessler,et al.  An experimental study of bibliographic coupling between technical papers (Corresp.) , 1963, IEEE Trans. Inf. Theory.

[26]  Patrick Wilson,et al.  Unused Relevant Information in Research and Development , 1995, J. Am. Soc. Inf. Sci..

[27]  Howard D. White,et al.  Author cocitation analysis and Pearson's r , 2003, J. Assoc. Inf. Sci. Technol..

[28]  Rey-Long Liu,et al.  A new bibliographic coupling measure with descriptive capability , 2017, Scientometrics.

[29]  Claudio Castellano,et al.  Universality of citation distributions: Toward an objective measure of scientific impact , 2008, Proceedings of the National Academy of Sciences.

[30]  Dangzhi Zhao,et al.  Evolution of research activities and intellectual influences in information science 1996–2005: Introducing author bibliographic-coupling analysis , 2008 .

[31]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[32]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[33]  Wolfgang Glänzel,et al.  Same data—different results? Towards a comparative approach to the identification of thematic structures in science , 2017, Scientometrics.

[34]  Sitaram Devarakonda,et al.  Co-citation Analysis , 2019 .

[35]  Charles F. F. Karney Algorithms for geodesics , 2011, Journal of Geodesy.

[36]  Ronald Rousseau,et al.  Author cocitation analysis and Pearson's r , 2004, J. Assoc. Inf. Sci. Technol..

[37]  Allan P. White,et al.  Technical Note: Bias in Information-Based Measures in Decision Tree Induction , 1994, Machine Learning.

[38]  Kevin W. Boyack,et al.  Which Type of Citation Analysis Generates the Most Accurate Taxonomy of Scientific and Technical Knowledge? , 2015, J. Assoc. Inf. Sci. Technol..

[39]  Yi-Cheng Zhang,et al.  Bipartite network projection and personal recommendation. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[40]  Grégoire Côté,et al.  Scopus as a curated, high-quality bibliometric data source for academic research in quantitative science studies , 2020, Quantitative Science Studies.

[41]  Yong-Yeol Ahn,et al.  Element-centric clustering comparison unifies overlaps and hierarchy , 2017, Scientific Reports.

[42]  Yang Song,et al.  An Overview of Microsoft Academic Service (MAS) and Applications , 2015, WWW.

[43]  Wolfgang Glänzel,et al.  Same data—different results? Towards a comparative approach to the identification of thematic structures in science , 2017, Scientometrics.

[44]  Leon Danon,et al.  Comparing community structure identification , 2005, cond-mat/0505245.

[45]  Leo Egghe,et al.  Co-citation, bibliographic coupling and a characterization of lattice citation networks , 2002, Scientometrics.

[46]  Loet Leydesdorff,et al.  Similarity Measures, Author Cocitation Analysis, and Information Theory , 2005, J. Assoc. Inf. Sci. Technol..

[47]  Wei Zhong Liu,et al.  Bias in information-based measures in decision tree induction , 1994, Machine Learning.

[48]  Stephen J. Bensman Pearson's r and author cocitation analysis: A commentary on the controversy , 2004, J. Assoc. Inf. Sci. Technol..

[49]  Henry G. Small,et al.  Clustering the science citation index using co-citations. II. Mapping science , 1985, Scientometrics.

[50]  Wolfgang Glänzel,et al.  Using ‘core documents’ for detecting and labelling new emerging topics , 2011, Scientometrics.

[51]  Donald Owen Case,et al.  How can we investigate citation behavior? A study of reasons for citing literature in communication , 2000, J. Am. Soc. Inf. Sci..

[52]  Per Ahlgren,et al.  Experimental comparison of first and second-order similarities in a scientometric context , 2011, Scientometrics.

[53]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[54]  Wolfgang Glänzel,et al.  Using hybrid methods and ‘core documents’ for the representation of clusters and topics: the astronomy dataset , 2017, Scientometrics.