Adventures in data citation: sorghum genome data exemplifies the new gold standard

Scientific progress is driven by the availability of information, which makes it essential that data be broadly, easily and rapidly accessible to researchers in every field. In addition to being good scientific practice, provision of supporting data in a convenient way increases experimental transparency and improves research efficiency by reducing unnecessary duplication of experiments. There are, however, serious constraints that limit extensive data dissemination. One such constraint is that, despite providing a major foundation of data to the advantage of entire community, data producers rarely receive the credit they deserve for the substantial amount of time and effort they spend creating these resources. In this regard, a formal system that provides recognition for data producers would serve to incentivize them to share more of their data.The process of data citation, in which the data themselves are cited and referenced in journal articles as persistently identifiable bibliographic entities, is a potential way to properly acknowledge data output. The recent publication of several sorghum genomes in Genome Biology is a notable first example of good data citation practice in the field of genomics and demonstrates the practicalities and formatting required for doing so. It also illustrates how effective use of persistent identifiers can augment the submission of data to the current standard scientific repositories.

[1]  F. Collins,et al.  The Human Genome Project: Lessons from Large-Scale Biology , 2003, Science.

[2]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[3]  W. John Kress,et al.  Semantic tagging of and semantic enhancements to systematics papers: ZooKeys working examples , 2010, ZooKeys.

[4]  Christopher P Austin,et al.  Prepublication data sharing , 2009, Nature.

[5]  Wenwei Zhang,et al.  Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome , 2012, Nature Biotechnology.

[6]  A. Coulson,et al.  Genomics in C. elegans: so many genes, such a little worm. , 2005, Genome research.

[7]  Junhua Li,et al.  Open-source genomic analysis of Shiga-toxin-producing E. coli O104:H4. , 2011, The New England journal of medicine.

[8]  A. H. Ball,et al.  How to Cite Datasets and Link to Publications:A Report of the Digital Curation Centre , 2012 .

[9]  Eugenie Samuel Reich,et al.  Cancer trial errors revealed , 2011, Nature.

[10]  E. Callaway Report finds massive fraud at Dutch universities , 2011, Nature.

[11]  Steven R. Williams,et al.  Genome Sequence of E. coli O104:H4 Leads to Rapid Development of a Targeted Antimicrobial Agent against This Emerging Pathogen , 2012, PloS one.

[12]  E. Marshall Bermuda Rules: Community Spirit, With Teeth , 2001, Science.

[13]  Jian Wang,et al.  Genome sequencing and comparison of two nonhuman primate animal models, the cynomolgus and Chinese rhesus macaques , 2011, Nature Biotechnology.

[14]  S. Ramachandran,et al.  Genome-wide patterns of genetic variation in sweet and grain sorghum (Sorghum bicolor) , 2011, Genome Biology.

[15]  Credit where credit is overdue , 2009, Nature Biotechnology.

[16]  Lepidostroma vilgalysii, a new basidiolichen from the New World , 2012, Mycological Progress.

[17]  Heather A. Piwowar,et al.  Sharing Detailed Research Data Is Associated with Increased Citation Rate , 2007, PloS one.

[18]  Jared Lyle,et al.  The Enduring Value of Social Science Research: The Use and Reuse of Primary Research Data , 2010, iPRES.

[19]  Joan Starr,et al.  isCitedBy: A Metadata Scheme for DataCite , 2011 .

[20]  David M. Shotton,et al.  Adventures in Semantic Publishing: Exemplar Semantic Enhancements of a Research Article , 2009, PLoS Comput. Biol..

[21]  Heather A. Piwowar,et al.  Data archiving is a good investment , 2011, Nature.

[22]  R. Zahn,et al.  Southern Hemisphere Water Mass Conversion Linked with North Atlantic Climate Variability , 2005, Science.

[23]  J. Max Wilkinson,et al.  Making Datasets Visible and Accessible: DataCite's First Summer Meeting , 2010 .

[24]  Andreas Prlic,et al.  Integration of open access literature into the RCSB Protein Data Bank using BioLit , 2010, BMC Bioinformatics.

[25]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[26]  T. Vision Open Data and the Social Contract of Scientific Publishing , 2010 .

[27]  Joan Starr,et al.  isCitedBy: A Metadata Scheme for DataCite , 2011, D Lib Mag..