Patterns of database citation in articles and patents indicate long-term scientific and industry value of biological data resources

Data from open access biomolecular data resources, such as the European Nucleotide Archive and the Protein Data Bank are extensively reused within life science research for comparative studies, method development and to derive new scientific insights. Indicators that estimate the extent and utility of such secondary use of research data need to reflect this complex and highly variable data usage. By linking open access scientific literature, via Europe PubMedCentral, to the metadata in biological data resources we separate data citations associated with a deposition statement from citations that capture the subsequent, long-term, reuse of data in academia and industry. We extend this analysis to begin to investigate citations of biomolecular resources in patent documents. We find citations in more than 8,000 patents from 2014, demonstrating substantial use and an important role for data resources in defining biological concepts in granted patents to both academic and industrial innovators. Combined together our results indicate that the citation patterns in biomedical literature and patents vary, not only due to citation practice but also according to the data resource cited. The results guard against the use of simple metrics such as citation counts and show that indicators of data use must not only take into account citations within the biomedical literature but also include reuse of data in industry and other parts of society by including patents and other scientific and technical documents such as guidelines, reports and grant applications.

[1]  F. Montobbio,et al.  Knowledge diffusion from university and public research. A comparison between US, Japan and Europe using patent citations , 2009 .

[2]  G. Cochrane,et al.  The International Nucleotide Sequence Database Collaboration , 2011, Nucleic Acids Res..

[3]  Nuno A. Fonseca,et al.  Expression Atlas update—a database of gene and transcript expression from microarray- and sequencing-based functional genomics experiments , 2013, Nucleic Acids Res..

[4]  B. Brenner,et al.  The mechanism of pentabromopseudilin inhibition of myosin motor activity , 2009, Nature Structural &Molecular Biology.

[5]  Heather A. Piwowar,et al.  Beginning to track 1000 datasets from public repositories into the published literature , 2011, ASIST.

[6]  Richard Gibson,et al.  Content discovery and retrieval services at the European Nucleotide Archive , 2014, Nucleic Acids Res..

[7]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[8]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[9]  M. Nilges,et al.  Structure of a PH domain from the C. elegans muscle protein UNC-89 suggests a novel function. , 2000, Structure.

[10]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[11]  Robert D. Finn,et al.  InterPro in 2011: new developments in the family and domain prediction database , 2011, Nucleic acids research.

[12]  Mike Thelwall,et al.  Which are the best innovation support infrastructures for universities? Evidence from R&D output and commercial activities , 2014, Scientometrics.

[13]  A. Baxevanis The Importance of Biological Databases in Biological Discovery , 2003, Current protocols in bioinformatics.

[14]  Daniel R. Zerbino,et al.  Ensembl 2014 , 2013, Nucleic Acids Res..

[15]  A. Brazma,et al.  Reuse of public genome-wide gene expression data , 2012, Nature Reviews Genetics.

[16]  Melissa J. Landrum,et al.  RefSeq: an update on mammalian reference sequences , 2013, Nucleic Acids Res..

[17]  Heather A. Piwowar,et al.  Foundational studies for measuring the impact, prevalence, and patterns of publicly sharing biomedical research data , 2010 .

[18]  Jee-Hyub Kim,et al.  Database Citation in Full Text Biomedical Articles , 2013, PloS one.

[19]  Heather A. Piwowar Data reuse and scholarly reward: understanding practice and building infrastructure , 2013 .

[20]  Kimberly S. Hamilton,et al.  The increasing linkage between U.S. technology and public science , 1997 .

[21]  D. Manstein,et al.  Structural Basis for the Allosteric Interference of Myosin Function by Reactive Thiol Region Mutations G680A and G680V* , 2011, The Journal of Biological Chemistry.

[22]  Jayanta Bhattacharya,et al.  Words in Patents: Research Inputs and the Value of Innovativeness in Invention , 2012 .

[23]  George Papadatos,et al.  SureChEMBL: a large-scale, chemically annotated patent document database , 2015, Nucleic Acids Res..

[24]  Christopher W. Belter,et al.  Measuring the Value of Research Data: A Citation Analysis of Oceanographic Data Sets , 2014, PloS one.

[25]  Stephan C Schürer,et al.  BioAssay Ontology Annotations Facilitate Cross-Analysis of Diverse High-Throughput Screening Data Sets , 2011, Journal of biomolecular screening.

[26]  Sophia Ananiadou,et al.  Europe PMC: a full-text literature database for the life sciences and platform for innovation , 2014, Nucleic Acids Res..

[27]  Peter Woollard,et al.  A case study: semantic integration of gene-disease associations for type 2 diabetes mellitus from literature and biomedical data resources. , 2014, Drug discovery today.

[28]  Torsten Schwede,et al.  Protein modeling: what happened to the "protein structure gap"? , 2013, Structure.

[29]  Michael Schroeder,et al.  Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed , 2013, Journal of Biomedical Semantics.

[30]  Evaristo Jiménez-Contreras,et al.  Analyzing data citation practices using the data citation index , 2015, J. Assoc. Inf. Sci. Technol..

[31]  John P. Overington,et al.  Chemical databases: curation or integration by user-defined equivalence? , 2015, Drug discovery today. Technologies.

[32]  George Papadatos,et al.  Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents , 2015, Journal of Cheminformatics.

[33]  Daniel M. Lowe,et al.  Annotated Chemical Patent Corpus: A Gold Standard for Text Mining , 2014, PloS one.

[34]  John P. Overington,et al.  Role of open chemical data in aiding drug discovery and design. , 2010, Future medicinal chemistry.

[35]  Sameer Velankar,et al.  PDBe: Protein Data Bank in Europe , 2009, Nucleic Acids Res..

[36]  James Bessen The Value of U.S. Patents by Owner and Patent Characteristics , 2006 .

[37]  Mervyn Bregonje,et al.  Patents: A unique source for scientific technical information in chemistry related industry? , 2005 .

[38]  Senay Kafkas,et al.  Database citation in supplementary data linked to Europe PubMed Central full text biomedical articles , 2015, J. Biomed. Semant..

[39]  Ariel Pakes,et al.  Estimates of the Value of Patent Rights in European Countries During Thepost-1950 Period , 1985 .

[40]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[41]  Robert D. Finn,et al.  The European Bioinformatics Institute in 2016: Data growth and integration , 2015, Nucleic Acids Res..

[42]  Peter Murray-Rust,et al.  Mining chemical information from open patents , 2011, J. Cheminformatics.

[43]  Christine Hine,et al.  Databases as Scientific Instruments and Their Role in the Ordering of Scientific Work , 2006 .

[44]  Dietrich Rebholz-Schuhmann,et al.  Text processing through Web services: calling Whatizit , 2008, Bioinform..