Trends in the production of scientific data analysis resources

BackgroundAs the amount of scientific data grows, peer-reviewed Scientific Data Analysis Resources (SDARs) such as published software programs, databases and web servers have had a strong impact on the productivity of scientific research. SDARs are typically linked to using an Internet URL, which have been shown to decay in a time-dependent fashion. What is less clear is whether or not SDAR-producing group size or prior experience in SDAR production correlates with SDAR persistence or whether certain institutions or regions account for a disproportionate number of peer-reviewed resources.MethodsWe first quantified the current availability of over 26,000 unique URLs published in MEDLINE abstracts/titles over the past 20 years, then extracted authorship, institutional and ZIP code data. We estimated which URLs were SDARs by using keyword proximity analysis.ResultsWe identified 23,820 non-archival URLs produced between 1996 and 2013, out of which 11,977 were classified as SDARs. Production of SDARs as measured with the Gini coefficient is more widely distributed among institutions (.62) and ZIP codes (.65) than scientific research in general, which tends to be disproportionately clustered within elite institutions (.91) and ZIPs (.96). An estimated one percent of institutions produced 68% of published research whereas the top 1% only accounted for 16% of SDARs. Some labs produced many SDARs (maximum detected = 64), but 74% of SDAR-producing authors have only published one SDAR. Interestingly, decayed SDARs have significantly fewer average authors (4.33 +/- 3.06), than available SDARs (4.88 +/- 3.59) (p < 8.32 × 10-4). Approximately 3.4% of URLs, as published, contain errors in their entry/format, including DOIs and links to clinical trials registry numbers.ConclusionSDAR production is less dependent upon institutional location and resources, and SDAR online persistence does not seem to be a function of infrastructure or expertise. Yet, SDAR team size correlates positively with SDAR accessibility, suggesting a possible sociological factor involved. While a detectable URL entry error rate of 3.4% is relatively low, it raises the question of whether or not this is a general error rate that extends to additional published entities.

[1]  Jason Hennessey,et al.  A cross disciplinary study of link decay and the effectiveness of mitigation techniques , 2013, BMC Bioinformatics.

[2]  James Pustejovsky,et al.  Biomedical term mapping databases , 2004, Nucleic Acids Res..

[3]  Keith Yamamoto,et al.  Commentary: Team science. , 2013, Academic medicine : journal of the Association of American Medical Colleges.

[4]  BMC Bioinformatics , 2005 .

[5]  Fang Liu,et al.  An update on Uniform Resource Locator (URL) decay in MEDLINE abstracts and measures for its mitigation , 2008, BMC Medical Informatics Decis. Mak..

[6]  Wood Eh,et al.  The association of American medical colleges , 1998 .

[7]  G. Sheldrick A short history of SHELX. , 2008, Acta crystallographica. Section A, Foundations of crystallography.

[8]  Jonathan D. Wren,et al.  A scalable machine-learning approach to recognize chemical names within large text databases , 2006, BMC Bioinformatics.

[9]  Christopher Baethge,et al.  Publish together or perish: the increasing number of authors per article in academic journals is the consequence of a changing scientific culture. Some researchers define authorship quite loosely. , 2008, Deutsches Arzteblatt international.

[10]  R. Dellavalle,et al.  The write position , 2007, EMBO reports.

[11]  Jan A. Kors,et al.  Consistency of systematic chemical identifiers within and between small-molecule databases , 2012, Journal of Cheminformatics.

[12]  Gunther Eysenbach,et al.  Going, Going, Still There: Using the WebCite Service to Permanently Archive Cited Web Pages , 2005, AMIA.

[13]  Miguel A. Andrade-Navarro,et al.  Evolving research trends in bioinformatics , 2006, Briefings Bioinform..

[14]  Lisa M Schilling,et al.  Information science. Going, going, gone: lost Internet references. , 2003, Science.

[15]  J. Weiner,et al.  Describing inequality in plant size or fecundity , 2000 .

[16]  Rolf Zetterström,et al.  The number of authors of scientific publications , 2004, Acta paediatrica.

[17]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[18]  Hans-Michael Müller,et al.  The Neuroscience Information Framework: A Data and Knowledge Environment for Neuroscience , 2008, Neuroinformatics.

[19]  Jonathan D. Wren,et al.  URL decay in MEDLINE - a 4-year follow-up study , 2008, Bioinform..

[20]  R. Dellavalle,et al.  Going, Going, Gone: Lost Internet References , 2003, Science.

[21]  John T. Slattery,et al.  The Road We Must Take: Multidisciplinary Team Science , 2010, Science Translational Medicine.