Experiences in integrated data and research object publishing using GigaDB

In the era of computation and data-driven research, traditional methods of disseminating research are no longer fit-for-purpose. New approaches for disseminating data, methods and results are required to maximize knowledge discovery. The “long tail” of small, unstructured datasets is well catered for by a number of general-purpose repositories, but there has been less support for “big data”. Outlined here are our experiences in attempting to tackle the gaps in publishing large-scale, computationally intensive research. GigaScience is an open-access, open-data journal aiming to revolutionize large-scale biological data dissemination, organization and re-use. Through use of the data handling infrastructure of the genomics centre BGI, GigaScience links standard manuscript publication with an integrated database (GigaDB) that hosts all associated data, and provides additional data analysis tools and computing resources. Furthermore, the supporting workflows and methods are also integrated to make published articles more transparent and open. GigaDB has released many new and previously unpublished datasets and data types, including as urgently needed data to tackle infectious disease outbreaks, cancer and the growing food crisis. Other “executable” research objects, such as workflows, virtual machines and software from several GigaScience articles have been archived and shared in reproducible, transparent and usable formats. With data citation producing evidence of, and credit for, its use in the wider research community, GigaScience demonstrates a move towards more executable publications. Here data analyses can be reproduced and built upon by users without coding backgrounds or heavy computational infrastructure in a more democratized manner.

[1]  David L. Donoho,et al.  WaveLab and Reproducible Research , 1995 .

[2]  Steven R. Williams,et al.  Genome Sequence of E. coli O104:H4 Leads to Rapid Development of a Targeted Antimicrobial Agent against This Emerging Pathogen , 2012, PloS one.

[3]  Richard Van Noorden Sluggish data sharing hampers reproducibility effort , 2015 .

[4]  Peter Li,et al.  GigaDB: promoting data dissemination and reproducibility , 2014, Database J. Biol. Databases Curation.

[5]  Hui Jiang,et al.  Non-targeted metabolomics and lipidomics LC–MS data from maternal plasma of 180 healthy pregnant women , 2015, GigaScience.

[6]  Aakrosh Ratan,et al.  Galaxy tools to study genome diversity , 2013, GigaScience.

[7]  M. Nickerson,et al.  A locally funded Puerto Rican parrot (Amazona vittata) genome sequencing project increases avian data and advances young researcher education , 2012, GigaScience.

[8]  A. Vickers,et al.  Empirical Study of Data Sharing by Authors Publishing in PLoS Journals , 2009, PloS one.

[9]  Shanlin Liu,et al.  Eupolybothrus cavernicolus Komerički & Stoev sp. n. (Chilopoda: Lithobiomorpha: Lithobiidae): the first eukaryotic species description combining transcriptomic, DNA barcoding and micro-CT imaging data , 2013, Biodiversity data journal.

[10]  Jun Wang,et al.  Genomic Diversity and Evolution of the Head Crest in the Rock Pigeon , 2013, Science.

[11]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[12]  David A. Matthews,et al.  Real-time, portable genome sequencing for Ebola surveillance , 2016, Nature.

[13]  International Commission on Zoological Nomenclatur Amendment of Articles 8, 9, 10, 21 and 78 of the International Code of Zoological Nomenclature to expand and refine methods of publication , 2012, ZooKeys.

[14]  Ernesto Reuben,et al.  (Un)Available upon Request: Field Experiment on Researchers' Willingness to Share Supplementary Materials , 2012, Accountability in research.

[15]  Jun Wang,et al.  Comparative genomic data of the Avian Phylogenomics Project , 2014, GigaScience.

[16]  Aaron R Quinlan,et al.  Erratum: A reference bacterial genome dataset generated on the MinIONTM portable single-molecule nanopore sequencer , 2015, GigaScience.

[17]  C. Ball,et al.  Repeatability of published microarray gene expression analyses , 2009, Nature Genetics.

[18]  Aaron R. Quinlan,et al.  A reference bacterial genome dataset generated on the MinION™ portable single-molecule nanopore sequencer , 2014, bioRxiv.

[19]  Carole A. Goble,et al.  Why Linked Data is Not Enough for Scientists , 2010, 2010 IEEE Sixth International Conference on e-Science.

[20]  Iczn,et al.  Amendment of Articles 8, 9, 10, 21 and 78 of the International Code of Zoological Nomenclature to expand and refine methods of publication , 2012 .

[21]  Mauro Giavalisco Galaxy Evolution , 2006 .

[22]  Thomas Jackson,et al.  A data repository and analysis framework for spontaneous neural activity recordings in developing retina , 2013, bioRxiv.

[23]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[24]  René Tänzler,et al.  Integrative taxonomy on the fast track - towards more sustainability in biodiversity research , 2013, Frontiers in Zoology.

[25]  Siu-Ming Yiu,et al.  Erratum: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2015, GigaScience.

[26]  C. Begley,et al.  Drug development: Raise standards for preclinical cancer research , 2012, Nature.

[27]  Erika Check Hayden,et al.  Open-data project aims to ease the way for genomic research , 2012 .

[28]  Heather A. Piwowar,et al.  Data reuse and the open data citation advantage , 2013, PeerJ.

[29]  Chiara Alvisi,et al.  I.A.P. , 2016 .

[30]  Huanming Yang Support the Manchester Manifesto: a case study of the free sharing of human genome data , 2011 .

[31]  Lyubomir Penev,et al.  Biodiversity research in the “big data” era: GigaScience and Pensoft work together to publish the most data-rich species description , 2013, GigaScience.

[32]  Credit where credit is overdue , 2009, Nature Biotechnology.

[33]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[34]  A. Mikheyev,et al.  A first look at the Oxford Nanopore MinION sequencer , 2014, Molecular ecology resources.

[35]  A. Casadevall,et al.  Retracted Science and the Retraction Index , 2011, Infection and Immunity.

[36]  A. du Plessis,et al.  A dataset describing brooding in three species of South African brittle stars, comprising seven high-resolution, micro X-ray computed tomography scans , 2015, GigaScience.

[37]  Junhua Li,et al.  Open-source genomic analysis of Shiga-toxin-producing E. coli O104:H4. , 2011, The New England journal of medicine.

[38]  Alexander Sczyrba,et al.  Deeply sequenced metagenome and metatranscriptome of a biogas-producing microbial community from an agricultural production-scale biogas plant , 2015, GigaScience.

[39]  V. Stodden,et al.  Toward Reproducible Computational Research: An Empirical Analysis of Data and Code Policy Adoption by Journals , 2013, PloS one.

[40]  J. Mervis U.S. science policy. Agencies rally to tackle big data. , 2012, Science.

[41]  Gonzalo Giribet,et al.  Sine Systemate Chaos? A Versatile Tool for Earthworm Taxonomy: Non-Destructive Imaging of Freshly Fixed and Museum Specimens Using Micro-Computed Tomography , 2014, PloS one.

[42]  Peter Kellman,et al.  Free breathing myocardial perfusion data sets for performance analysis of motion compensation algorithms , 2014, GigaScience.

[43]  Alex M. Warren Repeatability and Benefaction in Computer Systems Research — A Study and a Modest Proposal , 2015 .

[44]  J. McInerney,et al.  Heterogeneous Models Place the Root of the Placental Mammal Phylogeny , 2013, Molecular biology and evolution.

[45]  Nicholas Gruen,et al.  Open research data: Report to the Australian National Data Service (ANDS) , 2014 .

[46]  Laurie Goodman,et al.  Large and linked in scientific publishing , 2012, GigaScience.

[47]  Richard Van Noorden Half of 2011 papers now free to read , 2013, Nature.

[48]  Nicholas Gruen,et al.  Open Research Data , 2014 .

[49]  Jun Wang,et al.  Population Genomics Reveal Recent Speciation and Rapid Evolutionary Adaptation in Polar Bears , 2014, Cell.

[50]  A. H. Ball,et al.  How to Cite Datasets and Link to Publications:A Report of the Digital Curation Centre , 2012 .

[51]  I. Cockburn,et al.  The Economics of Reproducibility in Preclinical Research , 2015, PLoS biology.

[52]  J. Ioannidis,et al.  Public Availability of Published Research Data in High-Impact Journals , 2011, PloS one.

[53]  M. Whitlock Data archiving in ecology and evolution: best practices. , 2011, Trends in ecology & evolution.

[54]  Quinn Snell,et al.  Pathoscope: Species identification and strain attribution with unassembled sequencing data , 2013, Genome research.

[55]  M. Slatkin,et al.  Genomic Evidence for Island Population Conversion Resolves Conflicting Theories of Polar Bear Evolution , 2013, PLoS genetics.

[56]  Brian Hole,et al.  Adventures in data citation: sorghum genome data exemplifies the new gold standard , 2012, BMC Research Notes.

[57]  M. S. Avila-Garcia,et al.  From Peer-Reviewed to Peer-Reproduced in Scholarly Publishing: The Complementary Roles of Data Models and Workflows in Bioinformatics , 2015, PloS one.

[58]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[59]  Loman Nicholas,et al.  A P. aeruginosa serotype-defining single read from our first Oxford Nanopore run , 2014 .

[60]  Anne E. Trefethen,et al.  Toward interoperable bioscience data , 2012, Nature Genetics.

[61]  Pablo Pareja-Tobes,et al.  BG7: A New Approach for Bacterial Genome Annotation Designed for Next Generation Sequencing Data , 2012, IWBBIO.

[62]  Alvin T. Liem,et al.  Bacterial and viral identification and differentiation by amplicon sequencing on the MinION nanopore sequencer , 2015, GigaScience.

[63]  T. Warnow,et al.  Phylogenomic analyses data of the avian phylogenomics project , 2015, GigaScience.

[64]  Alexander Sczyrba,et al.  Bioboxes: standardised containers for interchangeable bioinformatics software , 2015, GigaScience.

[65]  George P Patrinos,et al.  Recommendations for Genetic Variation Data Capture in Developing Countries to Ensure a Comprehensive Worldwide Data Collection , 2010, Human mutation.

[66]  G. Giribet,et al.  A dataset comprising four micro-computed tomography scans of freshly fixed and museum earthworm specimens , 2014, GigaScience.

[67]  Iain Hrynaszkiewicz,et al.  Open by default: a proposed copyright license and waiver agreement for open access research and data in peer-reviewed journals , 2012, BMC Research Notes.

[68]  Aaron R Quinlan,et al.  A reference bacterial genome dataset generated on the MinION™ portable single-molecule nanopore sequencer , 2014, GigaScience.

[69]  Peter Li,et al.  GigaDB: announcing the GigaScience database , 2012, GigaScience.

[70]  D. Clery Galaxy evolution. Galaxy zoo volunteers share pain and glory of research. , 2011, Science.

[71]  C. Faber,et al.  A dataset comprising 141 magnetic resonance imaging scans of 98 extant sea urchin species , 2014, GigaScience.

[72]  John P. A. Ioannidis,et al.  How to Make More Published Research True , 2014, PLoS medicine.

[73]  Greg Wilson,et al.  Software Carpentry: lessons learned , 2014, F1000Research.