A Survey of Bioinformatics Database and Software Usage through Mining the Literature

Computer-based resources are central to much, if not most, biological and medical research. However, while there is an ever expanding choice of bioinformatics resources to use, described within the biomedical literature, little work to date has provided an evaluation of the full range of availability or levels of usage of database and software resources. Here we use text mining to process the PubMed Central full-text corpus, identifying mentions of databases or software within the scientific literature. We provide an audit of the resources contained within the biomedical literature, and a comparison of their relative usage, both over time and between the sub-disciplines of bioinformatics, biology and medicine. We find that trends in resource usage differs between these domains. The bioinformatics literature emphasises novel resource development, while database and software usage within biology and medicine is more stable and conservative. Many resources are only mentioned in the bioinformatics literature, with a relatively small number making it out into general biology, and fewer still into the medical literature. In addition, many resources are seeing a steady decline in their usage (e.g., BLAST, SWISS-PROT), though some are instead seeing rapid growth (e.g., the GO, R). We find a striking imbalance in resource usage with the top 5% of resource names (133 names) accounting for 47% of total usage, and over 70% of resources extracted being only mentioned once each. While these results highlight the dynamic and creative nature of bioinformatics research they raise questions about software reuse, choice and the sharing of bioinformatics practice. Is it acceptable that so many resources are apparently never reused? Finally, our work is a step towards automated extraction of scientific method from text. We make the dataset generated by our study available under the CC0 license here: http://dx.doi.org/10.6084/m9.figshare.1281371.

[1]  Alex Bateman,et al.  Databases, data tombs and dust in the wind , 2008, Bioinform..

[2]  Z. Merali Computational science: ...Error , 2010, Nature.

[3]  Radha Boddepalli,et al.  DoD2007: 1082 molecular biology databases , 2007, Bioinformation.

[4]  Michael Y. Galperin The Molecular Biology Database Collection: 2006 update , 2005, Nucleic Acids Res..

[5]  Hongfang Liu,et al.  A comparison study on algorithms of detecting long forms for short forms in biomedical text , 2007, BMC Bioinformatics.

[6]  Phil Gooch,et al.  BADREX: In situ expansion and coreference of biomedical abbreviations using dynamic regular expressions , 2012, ArXiv.

[7]  Jeffrey S. Grethe,et al.  Resource Disambiguator for the Web: Extracting Biomedical Resources and Their Citations from the Scientific Literature , 2016, PloS one.

[8]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[9]  Miguel García-Remesal,et al.  BIRI: a new approach for automatically discovering and indexing available public bioinformatics resources from the literature , 2009, BMC Bioinformatics.

[10]  Robert Stevens,et al.  bioNerDS: exploring bioinformatics’ database and software use through literature mining , 2013, BMC Bioinformatics.

[11]  Robert Stevens,et al.  Ambiguity and variability of database and software names in bioinformatics , 2012, J. Biomed. Semant..

[12]  Hilmar Lapp,et al.  Software Engineering as Instrumentation for the Long Tail of Scientific Software , 2013, ArXiv.

[13]  James Howison,et al.  Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature , 2016, J. Assoc. Inf. Sci. Technol..

[14]  Michael Y. Galperin,et al.  The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection , 2011, Nucleic Acids Res..

[15]  Robert Stevens,et al.  Extracting patterns of database and software usage from the bioinformatics literature , 2014, Bioinform..

[16]  Scott McMillan,et al.  The Bioinformatics Links Directory: a Compilation of Molecular Biology Web Servers , 2005, Nucleic Acids Res..

[17]  Yasunori Yamamoto,et al.  OReFiL: an online resource finder for life sciences , 2007, BMC Bioinformatics.

[18]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[19]  Emmanuel Barillot,et al.  DBcat: a catalog of 500 biological databases , 2000, Nucleic Acids Res..

[20]  Saharon Rosset,et al.  Model selection via the AUC , 2004, ICML.

[21]  Don Gilbert,et al.  Bioinformatics software resources. , 2004, Briefings in bioinformatics.

[22]  Sophia Ananiadou,et al.  Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry , 2011, PloS one.

[23]  Lapp Hilmar,et al.  Software Engineering as Instrumentation for the Long Tail of Scientific Software , 2013 .

[24]  A. Gawer,et al.  How Companies Become Platform Leaders , 2008 .

[25]  Greg Wilson,et al.  Software Carpentry: Getting Scientists to Write Better Code by Making Them More Productive , 2006, Computing in Science & Engineering.

[26]  James D. Herbsleb,et al.  Scientific software production: incentives and collaboration , 2011, CSCW.

[27]  Csongor Nyulas,et al.  BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications , 2011, Nucleic Acids Res..

[28]  Chih Jeng Kenneth Tan Computational science , 2002, Future Gener. Comput. Syst..

[29]  David L. Robertson,et al.  Methodology capture: discriminating between the "best" and the rest of community practice , 2008, BMC Bioinformatics.

[30]  Russ B. Altman,et al.  Time to Organize the Bioinformatics Resourceome , 2005, PLoS Comput. Biol..

[31]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[32]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[33]  James D. Herbsleb,et al.  Incentives and integration in scientific software production , 2013, CSCW.