Quantifying the Consistency of Scientific Databases

Science is a social process with far-reaching impact on our modern society. In recent years, for the first time we are able to scientifically study the science itself. This is enabled by massive amounts of data on scientific publications that is increasingly becoming available. The data is contained in several databases such as Web of Science or PubMed, maintained by various public and private entities. Unfortunately, these databases are not always consistent, which considerably hinders this study. Relying on the powerful framework of complex networks, we conduct a systematic analysis of the consistency among six major scientific databases. We found that identifying a single "best" database is far from easy. Nevertheless, our results indicate appreciable differences in mutual consistency of different databases, which we interpret as recipes for future bibliometric studies.

[1]  Matjaz Perc,et al.  Community Structure and the Evolution of Interdisciplinarity in Slovenia's Scientific Collaboration Network , 2014, PloS one.

[2]  W. Myers,et al.  Atypical Combinations and Scientific Impact , 2013 .

[3]  L. Krumov,et al.  Motifs in co-authorship networks and their relation to the impact of scientific publications , 2011 .

[4]  Harry Eugene Stanley,et al.  Reputation and impact in academic careers , 2013, Proceedings of the National Academy of Sciences.

[5]  M. Newman,et al.  Mixing patterns in networks. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[6]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[7]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[8]  Chris Arney,et al.  Networks, Crowds, and Markets: Reasoning about a Highly Connected World (Easley, D. and Kleinberg, J.; 2010) [Book Review] , 2013, IEEE Technology and Society Magazine.

[9]  Ramana Rao Kompella,et al.  Network Sampling via Edge-based Node Selection with Graph Induction , 2011 .

[10]  Renaud Lambiotte,et al.  Community structure and patterns of scientific collaboration in Business and Management , 2011, Scientometrics.

[11]  Christos Faloutsos,et al.  ANF: a fast and scalable tool for data mining in massive graphs , 2002, KDD.

[12]  Andrew McCallum,et al.  Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.

[13]  Loet Leydesdorff,et al.  Betweenness centrality as a driver of preferential attachment in the evolution of research collaboration networks , 2011, J. Informetrics.

[14]  Michael C. Hout,et al.  Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.

[15]  D J PRICE,et al.  NETWORKS OF SCIENTIFIC PAPERS. , 1965, Science.

[16]  Christos Faloutsos,et al.  Graph evolution: Densification and shrinking diameters , 2006, TKDD.

[17]  Jacob G Foster,et al.  Edge direction and the structure of networks , 2009, Proceedings of the National Academy of Sciences.

[18]  Dietmar Wolfram,et al.  Measuring Scholarly Impact: Methods and Practice , 2014 .

[19]  Stasa Milojevic,et al.  Principles of scientific research team formation and evolution , 2014, Proceedings of the National Academy of Sciences.

[20]  Hiroki Sayama,et al.  Characterizing Interdisciplinarity of Researchers and Research Topics Using Web Search Engines , 2012, PloS one.

[21]  Matthew E Falagas,et al.  Comparison of PubMed, Scopus, Web of Science, and Google Scholar: strengths and weaknesses , 2007, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[22]  R Pastor-Satorras,et al.  Dynamical and correlation properties of the internet. , 2001, Physical review letters.

[23]  Jari Saramäki,et al.  The strength of strong ties in scientific collaboration networks , 2011, ArXiv.

[24]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[25]  An Zeng,et al.  Ranking scientific publications: the effect of nonlinearity , 2014, Scientific Reports.

[26]  P. Ginsparg ArXiv at 20 , 2011, Nature.

[27]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[28]  E. David,et al.  Networks, Crowds, and Markets: Reasoning about a Highly Connected World , 2010 .

[29]  Robert H. Kushler,et al.  Exploratory Data Analysis With MATLAB® , 2006, Technometrics.

[30]  Albert-László Barabási,et al.  Quantifying Long-Term Scientific Impact , 2013, Science.

[31]  Vladimir Batagelj,et al.  Exploratory Social Network Analysis with Pajek , 2005 .

[32]  Dalibor Fiala,et al.  Network-based statistical comparison of citation topology of bibliographic databases , 2014, Scientific Reports.

[33]  M. Newman,et al.  The structure of scientific collaboration networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Hierarchical Organization of Modularity in Metabolic Networks Supporting Online Material , 2002 .

[35]  E GARFIELD,et al.  Citation indexes for science; a new dimension in documentation through association of ideas. , 2006, Science.

[36]  Raj Kumar,et al.  The strength of strong ties in scientific collaboration networks , 2012 .

[37]  S. Weisberg,et al.  Residuals and Influence in Regression , 1982 .

[38]  D. Mccloskey,et al.  The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives , 2008 .

[39]  Albert-László Barabási,et al.  Collective credit allocation in science , 2014, Proceedings of the National Academy of Sciences.

[40]  A. Vázquez,et al.  Network clustering coefficient without degree-correlation biases. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[41]  Roger Guimerà,et al.  Team Assembly Mechanisms Determine Collaboration Network Structure and Team Performance , 2005, Science.

[42]  Santo Fortunato,et al.  Characterizing and Modeling Citation Dynamics , 2011, PloS one.

[43]  Marko Bajec,et al.  An expert system for detecting automobile insurance fraud using social network analysis , 2011, Expert Syst. Appl..

[44]  R. Fisher FREQUENCY DISTRIBUTION OF THE VALUES OF THE CORRELATION COEFFIENTS IN SAMPLES FROM AN INDEFINITELY LARGE POPU;ATION , 1915 .

[45]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[46]  Thomas C. Rindflesch,et al.  Large-Scale Structure of a Network of Co-Occurring MeSH Terms: Statistical Analysis of Macroscopic Properties , 2014, PloS one.

[47]  Benjamin F. Jones,et al.  Supporting Online Material Materials and Methods Figs. S1 to S3 References the Increasing Dominance of Teams in Production of Knowledge , 2022 .

[48]  Matjaz Perc,et al.  Growth and structure of Slovenia's scientific collaboration network , 2010, J. Informetrics.

[49]  Zoran Levnajic,et al.  Revealing the Hidden Language of Complex Networks , 2014, Scientific Reports.

[50]  Santo Fortunato,et al.  Impact Factor : tracking the dynamics of individual scientific impact , 2014 .

[51]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[52]  Jon M. Kleinberg,et al.  Networks, Crowds, and Markets: Reasoning about a Highly Connected World [Book Review] , 2013, IEEE Technol. Soc. Mag..

[53]  Ying Ding,et al.  Measuring Scholarly Impact , 2014, Springer International Publishing.

[54]  Alessandro Vespignani,et al.  Dynamical Processes on Complex Networks , 2008 .

[55]  Marko Bajec,et al.  Model of complex networks based on citation dynamics , 2013, WWW.

[56]  A. Barabasi,et al.  Hierarchical Organization of Modularity in Metabolic Networks , 2002, Science.

[57]  Dalibor Fiala,et al.  Mining citation information from CiteSeer data , 2011, Scientometrics.

[58]  Michael Ley,et al.  The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives , 2002, SPIRE.

[59]  Jon Crowcroft,et al.  Network analysis of temporal trends in scholarly research productivity , 2012, J. Informetrics.