Best practice data life cycle approaches for the life sciences

Throughout history, the life sciences have been revolutionised by technological advances; in our era this is manifested by advances in instrumentation for data generation, and consequently researchers now routinely handle large amounts of heterogeneous data in digital formats. The simultaneous transitions towards biology as a data science and towards a ‘life cycle’ view of research data pose new challenges. Researchers face a bewildering landscape of data management requirements, recommendations and regulations, without necessarily being able to access data management training or possessing a clear understanding of practical approaches that can assist in data management in their particular research domain. Here we provide an overview of best practice data life cycle approaches for researchers in the life sciences/bioinformatics space with a particular focus on ‘omics’ datasets and computer-based data processing and analysis. We discuss the different stages of the data life cycle and provide practical suggestions for useful tools and resources to improve data management practices.

[1]  Lex Nederbragt,et al.  Good enough practices in scientific computing , 2016, PLoS Comput. Biol..

[2]  Amir Feizi,et al.  Strategies to improve usability and preserve accuracy in biological sequence databases , 2016, Proteomics.

[3]  Carole A. Goble,et al.  State of the nation in data integration for bioinformatics , 2008, J. Biomed. Informatics.

[4]  Daniel S. Katz,et al.  Four simple recommendations to encourage best practices in research software , 2017, F1000Research.

[5]  Menno Schilthuizen,et al.  Specimens as primary data: museums and 'open science'. , 2015, Trends in ecology & evolution.

[6]  Richard Gibson,et al.  Value, but high costs in post-deposition data curation , 2016, Database J. Biol. Databases Curation.

[7]  Amanda L. Whitmire,et al.  Water, Water, Everywhere: Defining and Assessing Data Sharing in Academia , 2016, PloS one.

[8]  Bradley Voytek,et al.  The Virtuous Cycle of a Data Ecosystem , 2016, PLoS Comput. Biol..

[10]  Ian M. Fingerman,et al.  Database resources of the National Center for Biotechnology Information , 2010, Nucleic Acids Res..

[11]  Allyson L. Lister,et al.  BioSharing: curated and crowd-sourced metadata standards, databases and data policies in the life sciences , 2016, Database J. Biol. Databases Curation.

[12]  Richard J. Edwards,et al.  Establishing a distributed national research infrastructure providing bioinformatics support to life science researchers in Australia , 2017, Briefings Bioinform..

[13]  Linda Naughton,et al.  Making sense of journal research data policies , 2016 .

[14]  John P. A. Ioannidis,et al.  Reproducible Research Practices and Transparency across the Biomedical Literature , 2016, PLoS biology.

[15]  James Taylor,et al.  Next-generation sequencing data interpretation: enhancing reproducibility and accessibility , 2012, Nature Reviews Genetics.

[16]  B. Björk,et al.  The Development of Open Access Journal Publishing from 1993 to 2009 , 2011, PloS one.

[17]  Nigel W. Hardy,et al.  Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project , 2008, Nature Biotechnology.

[18]  T. Magnuson,et al.  Reproducibility: Use mouse biobanks or lose them , 2015, Nature.

[19]  Chris Morris,et al.  Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data , 2017, bioRxiv.

[20]  Chris Morris,et al.  Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data , 2017, bioRxiv.

[21]  Jason Williams,et al.  Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators , 2017, bioRxiv.

[22]  Daniel S. Caetano,et al.  Forgotten treasures: the fate of data in animal behaviour studies , 2014, Animal Behaviour.

[23]  Walter G. Berendsohn,et al.  Strategies for the sustainability of online open-access biodiversity databases , 2014 .

[24]  Brian A. Nosek,et al.  How open science helps researchers succeed , 2016, eLife.

[25]  Peter N. Robinson,et al.  Human genotype–phenotype databases: aims, challenges and opportunities , 2015, Nature Reviews Genetics.

[26]  Wendy W. Chapman,et al.  A review of journal policies for sharing research data , 2008, ELPUB.

[27]  John P A Ioannidis,et al.  Improving Validation Practices in “Omics” Research , 2011, Science.

[28]  Carly Strasser,et al.  The fractured lab notebook: undergraduates and ecological data management training in the United States , 2012 .

[29]  D E Koshland,et al.  The price of progress. , 1988, Science.

[30]  Division on Earth Sharing Publication-Related Data and Materials:: Responsibilities of Authorship in the Life Sciences , 2003 .

[31]  David Gomez-Cabrero,et al.  Data integration in the era of omics: current and future challenges , 2014, BMC Systems Biology.

[32]  C. Richards,et al.  Genebanks in the post-genomic age: Emerging roles and anticipated uses , 2008 .

[33]  F. Arnaud,et al.  From core referencing to data re-use: two French national initiatives to reinforce paleodata stewardship (National Cyber Core Repository and LTER France Retro-Observatory) , 2017 .

[34]  I. Cuthill,et al.  Reporting : The ARRIVE Guidelines for Reporting Animal Research , 2010 .

[35]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[36]  A. Ehrenhalt,et al.  The price of progress , 2012, Nature.

[37]  Toshihisa Takagi,et al.  DNA Data Bank of Japan , 2016, Nucleic Acids Res..

[38]  Oliver Horlacher,et al.  The SIB Swiss Institute of Bioinformatics’ resources: focus on curated databases , 2015, Nucleic Acids Res..

[39]  Brian A. Nosek,et al.  An open investigation of the reproducibility of cancer biology research , 2014, eLife.

[40]  Tudor Groza,et al.  The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species , 2016, bioRxiv.

[41]  Haruki Nakamura,et al.  Protein Data Bank (PDB): The Single Global Macromolecular Structure Archive. , 2017, Methods in molecular biology.

[42]  Midori A. Harris,et al.  Model organism databases: essential resources that need the support of both funders and users , 2016, BMC Biology.

[43]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[44]  Kristin Vanderbilt,et al.  Completing the data life cycle: using information management in macrosystems ecology research , 2014 .

[45]  Rachel G Liao,et al.  A federated ecosystem for sharing genomic, clinical data , 2016, Science.

[46]  Rachel G Liao,et al.  Facilitating a culture of responsible and effective sharing of cancer genome data , 2016, Nature Medicine.

[47]  Daniel L. Moody,et al.  Measuring the Value Of Information - An Asset Valuation Approach , 1999, ECIS.

[48]  Ute Roessner,et al.  Best practice data life cycle approaches for the life , 2019 .

[49]  Robert Stevens,et al.  Ten Simple Rules for Selecting a Bio-ontology , 2016, PLoS Comput. Biol..

[50]  P. Watson Biospecimen Complexity—the Next Challenge for Cancer Research Biobanks? , 2016, Clinical Cancer Research.

[51]  Santiago Schnell,et al.  Ten Simple Rules for a Computational Biologist’s Laboratory Notebook , 2015, PLoS Comput. Biol..

[52]  Jocelyn Kaiser,et al.  BIOMEDICAL RESOURCES. Funding for key data resources in jeopardy. , 2016, Science.

[53]  Matthew B Jones,et al.  Ecoinformatics: supporting ecology as a data-intensive science. , 2012, Trends in ecology & evolution.

[54]  Edward Baker,et al.  Scratchpads 2.0: a Virtual Research Environment supporting scholarly collaboration, communication and data publication in biodiversity science , 2011, ZooKeys.

[55]  Florence Debarre,et al.  The Availability of Research Data Declines Rapidly with Article Age , 2013, Current Biology.

[56]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[57]  Rafael C. Jimenez,et al.  Top 10 metrics for life science software good practices , 2016, F1000Research.

[58]  Matej Oresic,et al.  Data standards can boost metabolomics research, and if there is a will, there is a way , 2015, Metabolomics.

[59]  Ryan P. Womack,et al.  Research Data in Core Journals in Biology, Chemistry, Mathematics, and Physics , 2015, PloS one.

[60]  Yann Joly,et al.  Data Sharing in the Post-Genomic World: The Experience of the International Cancer Genome Consortium (ICGC) Data Access Compliance Office (DACO) , 2012, PLoS Comput. Biol..

[61]  Ann J Wolpert,et al.  For the sake of inquiry and knowledge--the inevitability of open access. , 2013, The New England journal of medicine.

[62]  Stephen R. Piccolo,et al.  Tools and techniques for computational reproducibility , 2016, GigaScience.

[63]  Laura Christopherson,et al.  Data Management Lifecycle and Software Lifecycle Management in the Context of Conducting Science , 2014 .

[64]  W. Christopher Lenhardt,et al.  The Tao of open science for ecology , 2015 .

[65]  Anne E. Trefethen,et al.  Toward interoperable bioscience data , 2012, Nature Genetics.

[66]  Jonathan Cooper,et al.  Where next for the reproducibility agenda in computational biology? , 2016, BMC Systems Biology.

[67]  Heather A. Piwowar,et al.  Data reuse and the open data citation advantage , 2013, PeerJ.

[68]  Elizabeth D. Dalton,et al.  Data management education from the perspective of science educators , 2016 .

[69]  Oliver Butters,et al.  DataSHIELD - New Directions and Dimensions , 2017, Data Sci. J..

[70]  Anne E. Thessen,et al.  Data issues in the life sciences , 2011, ZooKeys.

[71]  Toni Kazic,et al.  Ten Simple Rules for Experiments’ Provenance , 2015, PLoS Comput. Biol..

[72]  Neil Beagrie,et al.  The Value and Impact of the European Bioinformatics Institute , 2016 .

[73]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[74]  Emily Walsh,et al.  Using Evernote as an Electronic Lab Notebook in a Translational Science Laboratory , 2013, Journal of laboratory automation.

[75]  M. Whitlock Data archiving in ecology and evolution: best practices. , 2011, Trends in ecology & evolution.

[76]  Scott D. Kahn On the Future of Genomic Data , 2011, Science.

[77]  Melissa Haendel,et al.  A sea of standards for omics data: sink or swim? , 2013, J. Am. Medical Informatics Assoc..

[78]  Christopher M. Buddle,et al.  Distributed under Creative Commons Cc-by 4.0 Non-repeatable Science: Assessing the Frequency of Voucher Specimen Deposition Reveals That Most Arthropod Research Cannot Be Verified , 2022 .

[79]  Carl Boettiger Case Study 3: A Reproducible R Notebook Using Docker , 2019 .

[80]  Luiz Olavo Bonino da Silva Santos,et al.  Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science Cloud , 2017, Inf. Serv. Use.

[81]  François Michonneau,et al.  Ten Simple Rules for Digital Data Storage , 2016, PeerJ Prepr..

[82]  Marco Brandizi,et al.  Updates to BioSamples database at European Bioinformatics Institute , 2014, Nucleic Acids Res..

[83]  M. Placet,et al.  Strategies for Sustainability , 2005 .

[84]  Alban Gaignard,et al.  Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities , 2017, Future Gener. Comput. Syst..

[85]  Robert D. Finn,et al.  The European Bioinformatics Institute in 2016: Data growth and integration , 2015, Nucleic Acids Res..

[86]  K. Hinsen ActivePapers: a platform for publishing and archiving computer-aided research , 2015, F1000Research.

[87]  Laura Lyman Rodriguez,et al.  The dbGaP data browser: a new tool for browsing dbGaP controlled-access genomic data , 2016, Nucleic Acids Res..

[88]  Data's shameful neglect. , 2009, Nature.

[89]  Kimberly Keeton,et al.  Why traditional storage systems don't help us save stuff forever , 2005 .

[90]  Barbara R. Jasny Realities of data sharing using the genome wars as case study - an historical perspective and commentary , 2012, EPJ Data Science.