The Number of Scholarly Documents on the Public Web

The number of scholarly documents available on the web is estimated using capture/recapture methods by studying the coverage of two major academic search engines: Google Scholar and Microsoft Academic Search. Our estimates show that at least 114 million English-language scholarly documents are accessible on the web, of which Google Scholar has nearly 100 million. Of these, we estimate that at least 27 million (24%) are freely available since they do not require a subscription or payment of any kind. In addition, at a finer scale, we also estimate the number of scholarly documents on the web for fifteen fields: Agricultural Science, Arts and Humanities, Biology, Chemistry, Computer Science, Economics and Business, Engineering, Environmental Sciences, Geosciences, Material Science, Mathematics, Medicine, Physics, Social Sciences, and Multidisciplinary, as defined by Microsoft Academic Search. In addition, we show that among these fields the percentage of documents defined as freely available varies significantly, i.e., from 12 to 50%.

[1]  Richard Van Noorden Open access: The true cost of science publishing , 2013, Nature.

[2]  B. Björk,et al.  Open Access to the Scientific Journal Literature: Situation 2009 , 2010, PloS one.

[3]  Judit Bar-Ilan,et al.  Citations to the “Introduction to informetrics” indexed by WOS, Scopus and Google Scholar , 2010, Scientometrics.

[4]  Vincent Larivière,et al.  Self-Selected or Mandated, Open Access Increases Citation Impact for Higher Quality Research , 2010, PloS one.

[5]  Charles Oppenheim,et al.  The citation advantage of open-access articles , 2008, J. Assoc. Inf. Sci. Technol..

[6]  Judit Bar-Ilan,et al.  Which h-index? — A comparison of WoS, Scopus and Google Scholar , 2008, Scientometrics.

[7]  L. Rivest,et al.  Rcapture: Loglinear Models for Capture-Recapture in R , 2007 .

[8]  Stevan Harnad,et al.  Ten-Year Cross-Disciplinary Comparison of the Growth of Open Access and How it Increases Research Citation Impact , 2005, IEEE Data Eng. Bull..

[9]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[10]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[11]  Giles,et al.  Searching the world wide Web , 1998, Science.

[12]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[13]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[14]  P. Jupp,et al.  Inference for Poisson and multinomial models for capture-recapture experiments , 1991 .

[15]  R. Cormack Log-linear models for capture-recapture , 1989 .

[16]  H. Akaike A new look at the statistical model identification , 1974 .

[17]  M. Kendall Probability and Statistical Inference , 1956, Nature.

[18]  Bo-Christer Björk,et al.  Scientific journal publishing: yearly volume and open access availability , 2009, Inf. Res..

[19]  Stephen E. Fienberg,et al.  How Large Is the World Wide Web , 2004 .

[20]  R. Cormack Interval estimation for mark-recapture studies of closed populations. , 1992, Biometrics.

[21]  F. C. Lincoln Calculating waterfowl abundance on the basis of banding returns , 1930 .