Quantifying and suppressing ranking bias in a large citation network

It is widely recognized that citation counts for papers from different fields cannot be directly compared because different scientific fields adopt different citation practices. Citation counts are also strongly biased by paper age since older papers had more time to attract citations. Various procedures aim at suppressing these biases and give rise to new normalized indicators, such as the relative citation count. We use a large citation dataset from Microsoft Academic Graph and a new statistical framework based on the Mahalanobis distance to show that the rankings by well known indicators, including the relative citation count and Google's PageRank score, are significantly biased by paper field and age. Our statistical framework to assess ranking bias allows us to exactly quantify the contributions of each individual field to the overall bias of a given ranking. We propose a general normalization procedure motivated by the z-score which produces much less biased rankings when applied to citation count and PageRank score.

[1]  Massimo Franceschet,et al.  PageRank , 2010, Commun. ACM.

[2]  Sergei Maslov,et al.  Finding scientific gems with Google's PageRank algorithm , 2006, J. Informetrics.

[3]  Ludo Waltman,et al.  A recursive field-normalized bibliometric performance indicator: an application to the field of library and information science , 2011, Scientometrics.

[4]  Lutz Bornmann,et al.  Universality of citation distributions–A validation of Radicchi et al.'s relative indicator c f = c-c 0 at the micro level using data from chemistry , 2009 .

[5]  Claudio Castellano,et al.  Rescaling citations of publications in Physics , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[6]  Ingo Scholtes,et al.  Quantifying the effect of editor–author relations on manuscript handling times , 2017, Scientometrics.

[7]  Daniel Sirtes,et al.  Finding the Easter eggs hidden by oneself: Why Radicchi and Castellano's (2012) fairness test for citation indicators is not fair , 2012, J. Informetrics.

[8]  Tibor Braun,et al.  Relative indicators and relational charts for comparative assessment of publication output and citation impact , 1986, Scientometrics.

[9]  Yi-Cheng Zhang,et al.  A time-respecting null model to explore the structure of growing networks , 2017, ArXiv.

[10]  Per Ahlgren,et al.  The effects and their stability of field normalization baseline on relative performance with respect to citation impact: A case study of 20 natural science departments , 2011, J. Informetrics.

[11]  Ludo Waltman,et al.  The relation between Eigenfactor, audience factor, and influence weight , 2010, J. Assoc. Inf. Sci. Technol..

[12]  Ingo Scholtes,et al.  The Social Dimension of Information Ranking: A Discussion of Research Challenges and Approaches , 2013, Socioinformatics.

[13]  Marcel Dunaiski,et al.  Evaluating paper and author ranking algorithms using impact and contribution awards , 2016, J. Informetrics.

[14]  Santo Fortunato,et al.  Attention Decay in Science , 2015, J. Informetrics.

[15]  Ying Cheng,et al.  Comparison of the effect of mean-based method and z-score for field normalization of citations at the level of Web of Science subject categories , 2014, Scientometrics.

[16]  Volkmar Pipek,et al.  Socioinformatics - The Social Impact of Interactions between Humans and IT , 2014, Springer Proceedings in Complexity.

[17]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .

[18]  Satu Alakangas,et al.  Microsoft Academic: is the phoenix getting wings? , 2016, Scientometrics.

[19]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[20]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[21]  Derek de Solla Price,et al.  A general theory of bibliometric and other cumulative advantage processes , 1976, J. Am. Soc. Inf. Sci..

[22]  Yi-Cheng Zhang,et al.  Identification of milestone papers through time-balanced network centrality , 2016, J. Informetrics.

[23]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[24]  Claudio Castellano,et al.  Why Sirtes's claims (Sirtes, 2012) do not square with reality , 2012, J. Informetrics.

[25]  An Zeng,et al.  Ranking scientific publications: the effect of nonlinearity , 2014, Scientific Reports.

[26]  Lutz Bornmann,et al.  What do citation counts measure? A review of studies on citing behavior , 2008, J. Documentation.

[27]  David F. Gleich,et al.  PageRank beyond the Web , 2014, SIAM Rev..

[28]  Sergei Maslov,et al.  Promise and Pitfalls of Extending Google's PageRank Algorithm to Citation Networks , 2008, The Journal of Neuroscience.

[29]  Marián Boguñá,et al.  Approximating PageRank from In-Degree , 2007, WAW.

[30]  Gabriel Pinski,et al.  Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics , 1976, Inf. Process. Manag..

[31]  Ingo Scholtes,et al.  Causality-driven slow-down and speed-up of diffusion in non-Markovian temporal networks , 2013, Nature Communications.

[32]  Yang Song,et al.  An Overview of Microsoft Academic Service (MAS) and Applications , 2015, WWW.

[33]  Giulio Cimini,et al.  Temporal effects in the growth of networks , 2011, Physical review letters.

[34]  Martin P. Brändle,et al.  The coverage of Microsoft Academic: analyzing the publication output of a university , 2017, Scientometrics.

[35]  Dima Shepelyansky,et al.  Google matrix analysis of directed networks , 2014, ArXiv.

[36]  M. E. J. Newman,et al.  The first-mover advantage in scientific publication , 2008, 0809.0522.

[37]  Ludo Waltman,et al.  A review of the literature on citation impact indicators , 2015, J. Informetrics.

[38]  Lutz Bornmann,et al.  Universality of citation distributions-A validation of Radicchi et al.'s relative indicator cf = c/c0 at the micro level using data from chemistry , 2009, J. Assoc. Inf. Sci. Technol..

[39]  Yi-Cheng Zhang,et al.  Ranking nodes in growing networks: When PageRank fails , 2015, Scientific Reports.

[40]  Claudio Castellano,et al.  Testing the fairness of citation indicators for comparison across scientific domains: The case of fractional citation counts , 2011, J. Informetrics.

[41]  Carl T. Bergstrom,et al.  The Eigenfactor™ Metrics , 2008, The Journal of Neuroscience.

[42]  Sergei Maslov,et al.  Ranking scientific publications using a model of network traffic , 2006, ArXiv.

[43]  E. Garfield The history and meaning of the journal impact factor. , 2006, JAMA.

[44]  Taha Yasseri,et al.  The distorted mirror of Wikipedia: a quantitative analysis of Wikipedia coverage of academics , 2013, EPJ Data Science.

[45]  P. Bonacich Power and Centrality: A Family of Measures , 1987, American Journal of Sociology.

[46]  Anthony F. J. van Raan,et al.  Universality of citation distributions revisited , 2011, J. Assoc. Inf. Sci. Technol..

[47]  Peter Vinkler,et al.  Evaluation of some methods for the relative assessment of scientific publications , 1986, Scientometrics.

[48]  Claudio Castellano,et al.  Universality of citation distributions: Toward an objective measure of scientific impact , 2008, Proceedings of the National Academy of Sciences.

[49]  Martin P. Brändle,et al.  Citation analysis with microsoft academic , 2016, Scientometrics.

[50]  Thed N. van Leeuwen,et al.  Redefining the field of economics: Improving field normalization for the application of bibliometric techniques in the field of economics , 2012 .

[51]  Jonas Lundberg,et al.  Lifting the crown - citation z-score , 2007, J. Informetrics.

[52]  Eli Upfal,et al.  Using PageRank to Characterize Web Structure , 2002, Internet Math..

[53]  Pedro Albarrán,et al.  The skewness of science in 219 sub-fields and a number of aggregates , 2010, Scientometrics.

[54]  Jonathan Adams,et al.  Calibrating the zoom — a test of Zitt’s hypothesis , 2008, Scientometrics.

[55]  James G. Corrigan,et al.  Programmatic evaluation and comparison based on standardized citation scores , 1983, IEEE Transactions on Engineering Management.

[56]  Ingo Scholtes,et al.  Predicting scientific success based on coauthorship networks , 2014, EPJ Data Science.

[57]  Torsten Suel,et al.  Local methods for estimating pagerank values , 2004, CIKM '04.

[58]  An Zeng,et al.  Ranking scientific publications with similarity-preferential mechanism , 2015, Scientometrics.

[59]  Giulio Cimini,et al.  Model-based evaluation of scientific impact indicators , 2016, Physical review. E.

[60]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[61]  Michel Zitt,et al.  Relativity of citation performance and excellence measures: From cross-field to cross-scale effects of field-normalisation , 2005, Scientometrics.