Are the discretised lognormal and hooked power law distributions plausible for citation data?

There is no agreement over which statistical distribution is most appropriate for modelling citation count data. This is important because if one distribution is accepted then the relative merits of different citation-based indicators, such as percentiles, arithmetic means and geometric means, can be more fully assessed. In response, this article investigates the plausibility of the discretised lognormal and hooked power law distributions for modelling the full range of citation counts, with an offset of 1. The citation counts from 23 Scopus subcategories were fitted to hooked power law and discretised lognormal distributions but both distributions failed a Kolmogorov–Smirnov goodness of fit test in over three quarters of cases. The discretised lognormal distribution also seems to have the wrong shape for citation distributions, with too few zeros and not enough medium values for all subjects. The cause of poor fits could be the impurity of the subject subcategories or the presence of interdisciplinary research. Although it is possible to test for subject subcategory purity indirectly through a goodness of fit test in theory with large enough sample sizes, it is probably not possible in practice. Hence it seems difficult to get conclusive evidence about the theoretically most appropriate statistical distribution.

[1]  L. A. Goodman,et al.  Kolmogorov-Smirnov tests for psychological research. , 1954, Psychological bulletin.

[2]  Vincent Larivière,et al.  Modeling a century of citation distributions , 2008, J. Informetrics.

[3]  Henk F. Moed,et al.  Coverage and citation impact of oncological journals in the Web of Science and Scopus , 2008, J. Informetrics.

[4]  Elizabeth S. Vieira,et al.  Citations to scientific articles: Its distribution and dependence on the article features , 2010, J. Informetrics.

[5]  Paul Wilson,et al.  The misuse of the Vuong test for non-nested models to test for zero-inflation , 2015 .

[6]  Santo Fortunato,et al.  Characterizing and Modeling Citation Dynamics , 2011, PloS one.

[7]  Mike Thelwall,et al.  The precision of the arithmetic mean, geometric mean and percentiles for citation data: An experimental simulation modelling approach , 2015, J. Informetrics.

[8]  Antonio Perianes-Rodríguez,et al.  Differences in citation impact across countries , 2015, J. Assoc. Inf. Sci. Technol..

[9]  Michel Zitt,et al.  The journal impact factor: angel, devil, or scapegoat? A comment on J.K. Vanclay’s article 2011 , 2012, Scientometrics.

[10]  Mike Thelwall,et al.  The discretised lognormal and hooked power law distributions for complete citation data: Best options for modelling and regression , 2016, J. Informetrics.

[11]  Mike Thelwall,et al.  Stopped Sum Models for Citation Data , 2015, ISSI.

[12]  Thed N. van Leeuwen,et al.  Towards a new crown indicator: an empirical analysis , 2010, Scientometrics.

[13]  Xiao-Hua Zhou,et al.  Confidence intervals for the log-normal mean . , 1997, Statistics in medicine.

[14]  Q. Vuong Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses , 1989 .

[15]  K. Lomax Business Failures: Another Example of the Analysis of Failure Data , 1954 .

[16]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[17]  S. Resnick,et al.  QQ Plots, Random Sets and Data from a Heavy Tailed Distribution , 2007, math/0702551.

[18]  V. Larivière,et al.  Design and Update of a Classification System: The UCSD Map of Science , 2012, PloS one.

[19]  Michal Brzezinski,et al.  Power laws in citation distributions: evidence from Scopus , 2014, Scientometrics.

[20]  Thed N. van Leeuwen,et al.  Towards a new crown indicator: Some theoretical considerations , 2010, J. Informetrics.

[22]  Jerzy Neyman,et al.  On a New Class of "Contagious" Distributions, Applicable in Entomology and Bacteriology , 1939 .

[23]  Colin S Gillespie,et al.  Fitting Heavy Tailed Distributions: The poweRlaw Package , 2014, 1407.3492.

[24]  Marta Sales-Pardo,et al.  Statistical validation of a global model for the distribution of the ultimate number of citations accrued by papers published in a scientific journal , 2010, J. Assoc. Inf. Sci. Technol..

[25]  J. Hilbe Negative Binomial Regression: Preface , 2007 .

[26]  David M. Pennock,et al.  Winners don't take all: Characterizing the competition for links on the web , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Dalibor Fiala,et al.  Publication boost in web of science journals and its effect on citation distributions , 2016, J. Assoc. Inf. Sci. Technol..

[28]  Vincent Larivière,et al.  Comparing Bibliometric Statistics Obtained from the Web of Science and Scopus , 2009, J. Assoc. Inf. Sci. Technol..

[29]  Mike Thelwall,et al.  Regression for citation data: An evaluation of different methods , 2014, J. Informetrics.

[30]  Industrial Strategy,et al.  International comparative performance of the UK research base , 2012 .

[31]  Chen Xiaoguan The UK: The Leader of International Research Base—Based on International Comparative Performance of the UK Research Base-2013 , 2015 .

[32]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[33]  Mike Thelwall,et al.  Mendeley readership altmetrics for medical articles: An analysis of 45 fields , 2016, J. Assoc. Inf. Sci. Technol..

[34]  Michael A. Stephens,et al.  Cramér-Von Mises Statistics for Discrete Distributions , 1994, International Encyclopedia of Statistical Science.

[35]  A. Pettitt,et al.  The Kolmogorov-Smirnov Goodness-of-Fit Statistic with Discrete and Grouped Data , 1977 .

[36]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[37]  Mike Thelwall,et al.  Distributions for cited articles from individual subjects and years , 2014, J. Informetrics.

[38]  Quentin L. Burrell,et al.  Extending Lotkaian informetrics , 2008, Inf. Process. Manag..

[39]  Derek de Solla Price,et al.  A general theory of bibliometric and other cumulative advantage processes , 1976, J. Am. Soc. Inf. Sci..

[40]  T. S. Evans,et al.  Universality of performance indicators based on citation and reference counts , 2011, Scientometrics.

[41]  Michael D. Gordon,et al.  Citation ranking versus subjective evaluation in the determination of journal hierachies in the social sciences , 1982, J. Am. Soc. Inf. Sci..

[42]  D. Darling,et al.  A Test of Goodness of Fit , 1954 .

[43]  Aristoklis D. Anastasiadis,et al.  Tsallis q-exponential describes the distribution of scientific citations—a new characterization of the impact , 2008, Scientometrics.

[44]  W. Stahel,et al.  Log-normal Distributions across the Sciences: Keys and Clues , 2001 .

[45]  Anthony F. J. van Raan,et al.  Universality of citation distributions revisited , 2011, J. Assoc. Inf. Sci. Technol..

[46]  Claudio Castellano,et al.  Universality of citation distributions: Toward an objective measure of scientific impact , 2008, Proceedings of the National Academy of Sciences.

[47]  Jonathan Furner,et al.  Scholarly communication and bibliometrics , 2005, Annu. Rev. Inf. Sci. Technol..

[48]  John W. Emerson,et al.  Nonparametric Goodness-of-Fit Tests for Discrete Null Distributions , 2011, R J..

[49]  Antonio Perianes-Rodríguez,et al.  University citation distributions , 2015, ISSI.

[50]  Claudio Castellano,et al.  A Reverse Engineering Approach to the Suppression of Citation Biases Reveals Universal Properties of Citation Distributions , 2012, PloS one.

[51]  H. Akaike A new look at the statistical model identification , 1974 .

[52]  Vincent Larivière,et al.  On the relationship between interdisciplinarity and scientific impact , 2009, J. Assoc. Inf. Sci. Technol..

[53]  Amy M. Hightower,et al.  Science and Engineering Indicators , 1993 .