AgBioData consortium recommendations for sustainable genomics and genetics databases for agriculture

Abstract The future of agricultural research depends on data. The sheer volume of agricultural biological data being produced today makes excellent data management essential. Governmental agencies, publishers and science funders require data management plans for publicly funded research. Furthermore, the value of data increases exponentially when they are properly stored, described, integrated and shared, so that they can be easily utilized in future analyses. AgBioData (https://www.agbiodata.org) is a consortium of people working at agricultural biological databases, data archives and knowledgbases who strive to identify common issues in database development, curation and management, with the goal of creating database products that are more Findable, Accessible, Interoperable and Reusable. We strive to promote authentic, detailed, accurate and explicit communication between all parties involved in scientific data. As a step toward this goal, we present the current state of biocuration, ontologies, metadata and persistence, database platforms, programmatic (machine) access to data, communication and sustainability with regard to data curation. Each section describes challenges and opportunities for these topics, along with recommendations and best practices.

Stephen P. Ficklin | Nic Herndon | Leonore Reiser | Tanya Z. Berardini | Chris Mungall | Clement Jonquet | Elizabeth Arnaud | Daureen Nesdill | Jill L. Wegrzyn | Christine G. Elsik | Lisa C. Harper | Taner Z. Sen | Bradford Condon | Monica Poelchau | Dorrie Main | Sook Jung | Marie-Angélique Laporte | Pankaj Jaiswal | Naama Menda | Ramona L. Walls | Carson M. Andorf | Laurel Cooper | Rex T. Nelson | Doreen Ware | Pierre Larmande | Ethalinda K. S. Cannon | James M. Reecy | Jing Yu | Jacqueline D. Campbell | Clayton L. Birkett | Steve Cannon | James Carson | Nathan A. Dunn | Andrew Farmer | David Grant | Emily S. Grau | Zhi-Liang Hu | Jodi Humann | Gerard R. Lazo | Fiona McCarthy | Monica C. Munoz-Torres | Sushma Naithani | Rex T. Nelson | Carissa A. Park | Lacey-Anne Sanderson | Margaret Staton | Sabarinath Subramaniam | Marcela K. Tello-Ruiz | Victor Unda | Deepak R. Unni | Liya Wang | Jason Williams | Margaret Woodhouse | Monica F. Poelchau | C. Mungall | M. Tello-Ruiz | C. Elsik | F. McCarthy | D. Grant | S. Cannon | D. Ware | T. Berardini | P. Jaiswal | L. Reiser | C. Jonquet | L. Cooper | M. Laporte | N. Dunn | E. Arnaud | S. Naithani | R. Walls | Naama Menda | Zhi-Liang Hu | J. Reecy | Liya Wang | T. Sen | S. Ficklin | P. Larmande | D. Main | Nic Herndon | A. Farmer | Bradford Condon | M. Munoz-Torres | L. Harper | J. Wegrzyn | M. Woodhouse | D. Unni | G. Lazo | Jason J. Williams | S. Subramaniam | M. Staton | J. Humann | Carissa A. Park | Victor Unda | Jing Yu | Emily S. Grau | C. Birkett | L. Sanderson | S. Jung | C. Park | J. Carson | Daureen Nesdill | Sook Jung | Marie-Angélique Laporte | Sushma Naithani | Nathan A. Dunn

[1]  Xosé M. Fernández-Suárez,et al.  The 2018 Nucleic Acids Research database issue and the online molecular biology database collection , 2017, Nucleic Acids Res..

[2]  Gos Micklem,et al.  InterMOD: integrated data and tools for the unification of model organism research , 2013, Scientific Reports.

[3]  Doreen Ware,et al.  The iPlant Collaborative: Cyberinfrastructure for Enabling Data to Discovery for the Life Sciences , 2016, PLoS biology.

[4]  Julie M. Sullivan,et al.  FlyMine: an integrated database for Drosophila and Anopheles genomics , 2007, Genome Biology.

[5]  Norman Paskin,et al.  Digital Object Identifiers for scientific data , 2005, Data Sci. J..

[6]  Kevin Crowston,et al.  Attitudes and norms affecting scientists’ data reuse , 2017, PloS one.

[7]  S. Rhee,et al.  Functional Annotation of the Arabidopsis Genome Using Controlled Vocabularies1 , 2004, Plant Physiology.

[8]  Yvonne M. Bradford,et al.  ZFIN, The zebrafish model organism database: Updates and new directions , 2015, Genesis.

[9]  Stephen P. Ficklin,et al.  Tripal v1.1: a standards-based toolkit for construction of online genetic and genomic databases , 2013, Database J. Biol. Databases Curation.

[10]  Nuno A. Fonseca,et al.  Expression Atlas: gene and protein expression across multiple studies and organisms , 2017, Nucleic Acids Res..

[11]  Jason Williams,et al.  Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators , 2017, bioRxiv.

[12]  Mark A. Musen,et al.  AgroPortal: A vocabulary and ontology repository for agronomy , 2018, Comput. Electron. Agric..

[13]  Ning Jiang,et al.  Our path to better science in less time using open data science tools , 2017, Nature Ecology &Evolution.

[14]  The Gene Ontology Consortium Expansion of the Gene Ontology knowledgebase and resources , 2016, Nucleic Acids Res..

[15]  S. Lewis,et al.  An integrated computational pipeline and database to support whole-genome sequence annotation , 2002, Genome Biology.

[16]  Xiaolin Wu,et al.  Animal QTLdb: an improved database tool for livestock animal QTL/association data dissemination in the post-genome era , 2012, Nucleic Acids Res..

[17]  Bo Wang,et al.  Gramene 2018: unifying comparative genomics and pathway resources for plant research , 2017, Nucleic Acids Res..

[18]  P. Jaiswal,et al.  The Plant Ontology: A Tool for Plant Genomics. , 2016, Methods in molecular biology.

[19]  Pietro Liò,et al.  The BioMart community portal: an innovative alternative to large, centralized data repositories , 2015, Nucleic Acids Res..

[20]  Tanya Z. Berardini,et al.  The Arabidopsis Information Resource (TAIR): gene structure and function annotation , 2007, Nucleic Acids Res..

[21]  Paul D. Shaw,et al.  Germinate 3 : development of a common platform to support the distribution of experimental data on crop wild relatives , 2017 .

[22]  Melinda R. Dwinell,et al.  Analysis of disease-associated objects at the Rat Genome Database , 2013, Database J. Biol. Databases Curation.

[23]  Stuart Weibel,et al.  The Dublin Core Metadata Initiative: Mission, Current Activities, and Future Directions , 2000, D Lib Mag..

[24]  Li Li,et al.  PlasmoDB: the Plasmodium genome resource. A database integrating experimental and computational data , 2003, Nucleic Acids Res..

[25]  Wen Huang,et al.  The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant , 2001, Nucleic Acids Res..

[26]  John Chilton,et al.  The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update , 2016, Nucleic Acids Res..

[27]  Uwe Scholz,et al.  Measures for interoperability of phenotypic data: minimum information requirements and formatting , 2016, Plant Methods.

[28]  The Gene Ontology Consortium,et al.  Expansion of the Gene Ontology knowledgebase and resources , 2016, Nucleic Acids Res..

[29]  Edith D. Wong,et al.  Outreach and online training services at the Saccharomyces Genome Database , 2017, Database J. Biol. Databases Curation.

[30]  Robert D. Finn,et al.  Ensembl Genomes 2018: an integrated omics infrastructure for non-vertebrate species , 2017, Nucleic Acids Res..

[31]  Tom Heath,et al.  Linked Data: Evolving the Web into a Global Data Space , 2011, Linked Data.

[32]  S. Lewis,et al.  Uberon, an integrative multi-species anatomy ontology , 2012, Genome Biology.

[33]  J Michael Cherry,et al.  XenMine: A genomic interaction tool for the Xenopus community. , 2017, Developmental biology.

[34]  Zhiyong Lu,et al.  Crowdsourcing in biomedicine: challenges and opportunities , 2016, Briefings Bioinform..

[35]  Michael Y. Galperin,et al.  The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection , 2011, Nucleic Acids Res..

[36]  Leonore Reiser,et al.  Using the Arabidopsis Information Resource (TAIR) to Find Information About Arabidopsis Genes , 2017, Current protocols in bioinformatics.

[37]  Mark A. Musen,et al.  BioPortal as a dataset of linked biomedical ontologies and terminologies in RDF , 2013, Semantic Web.

[38]  Jean-Luc Jannink,et al.  The Triticeae Toolbox: Combining Phenotype and Genotype Data to Advance Small‐Grains Breeding , 2016, The plant genome.

[39]  Tomas Ayala-Silva,et al.  GRIN-Global: an international project to develop a global plant genebank information management system. , 2010 .

[40]  Stuart Weibel,et al.  State of the Dublin Core Metadata Initiative, April 2003 , 2003, D Lib Mag..

[41]  Stuart Weibel The State of the Dublin Core Metadata Initiative , 1999 .

[42]  Anne E. Trefethen,et al.  Toward interoperable bioscience data , 2012, Nature Genetics.

[43]  Gregory Butler,et al.  A review of genomic data warehousing systems , 2014, Briefings Bioinform..

[44]  R. Ankeny,et al.  Re-thinking organisms: The impact of databases on model organism biology. , 2012, Studies in history and philosophy of biological and biomedical sciences.

[45]  Nikos Kyrpides,et al.  The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification , 2014, Nucleic Acids Res..

[46]  Gos Micklem,et al.  YeastMine—an integrated data warehouse for Saccharomyces cerevisiae data as a multipurpose tool-kit , 2012, Database J. Biol. Databases Curation.

[47]  Philip Lijnzaad,et al.  The Ensembl genome database project , 2002, Nucleic Acids Res..

[48]  Mark L. Blaxter,et al.  GenomeHubs: simple containerized setup of a custom Ensembl database and web server for any species , 2017, Database J. Biol. Databases Curation.

[49]  Robert Petryszak,et al.  Plant Reactome: a resource for plant pathways and comparative analysis , 2016, Nucleic Acids Res..

[50]  Matthias Lange,et al.  Towards recommendations for metadata and data handling in plant phenotyping. , 2015, Journal of experimental botany.

[51]  Eugene Zhang,et al.  The Planteome database: an integrated resource for reference ontologies, plant genomics and phenomics , 2017, Nucleic Acids Res..

[52]  Rion Dooley,et al.  Software-as-a-Service: The iPlant Foundation API , 2012 .

[53]  Alexander D. Diehl,et al.  Logical Development of the Cell Ontology , 2011, BMC Bioinformatics.

[54]  Tanya Z. Berardini,et al.  Assessment of community-submitted ontology annotations from a novel database-journal partnership , 2012, Database J. Biol. Databases Curation.

[55]  Paul N. Schofield,et al.  Using AberOWL for fast and scalable reasoning over BioPortal ontologies , 2016, ICBO.

[56]  Jill L. Wegrzyn,et al.  TreeGenes: A Forest Tree Genome Database , 2008, International journal of plant genomics.

[57]  Claire O'Donovan,et al.  Biocurators and Biocuration: surveying the 21st century challenges , 2012, Database J. Biol. Databases Curation.

[58]  Qian Li,et al.  Sustainable funding for biocuration: The Arabidopsis Information Resource (TAIR) as a case study of a subscription-based funding model , 2016, Database J. Biol. Databases Curation.

[59]  Chris Morris,et al.  Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data , 2017, bioRxiv.

[60]  David Grant,et al.  SoyBase: A Comprehensive Database for Soybean Genetic and Genomic Data , 2017 .

[61]  Henning Hermjakob,et al.  Reactome graph database: Efficient access to complex pathway data , 2018, PLoS Comput. Biol..

[62]  Lisa C. Harper,et al.  MaizeGDB update: new tools, data and interface for the maize model organism database , 2015, Nucleic Acids Res..

[63]  Markus Krummenacker,et al.  The MetaCyc database of metabolic pathways and enzymes , 2017, Nucleic acids research.

[64]  M. Ashburner,et al.  The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration , 2007, Nature Biotechnology.

[65]  R. Durbin,et al.  ACeDB and macace. , 1995, Methods in cell biology.

[66]  Daniel Gianola,et al.  Meta-Analysis of Quantitative Trait Association and Mapping Studies using Parametric and Non-Parametric Models , 2013 .

[67]  Arllet M. Portugal,et al.  Bridging the phenotypic and genetic data useful for integrated breeding through a data annotation using the Crop Ontology developed by the crop communities of practice , 2012, Front. Physio..

[68]  Lennart Martens,et al.  The Ontology Lookup Service: bigger and better , 2010, Nucleic Acids Res..

[69]  P. Karp,et al.  Creation of a Genome-Wide Metabolic Pathway Database for Populus trichocarpa Using a New Approach for Reconstruction and Curation of Metabolic Pathways for Plants1[W][OA] , 2010, Plant Physiology.

[70]  Erhard Rahm,et al.  Evolution of biomedical ontologies and mappings: Overview of recent approaches , 2016, Computational and structural biotechnology journal.

[71]  Marek S Skrzypek,et al.  Biocuration at the Saccharomyces genome database , 2015, Genesis.

[72]  Uwe Scholz,et al.  Corrigendum: Towards recommendations for metadata and data handling in plant phenotyping. , 2018, Journal of experimental botany.

[73]  James M. Reecy,et al.  Developmental progress and current status of the Animal QTLdb , 2015, Nucleic Acids Res..

[74]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[75]  J. E. Richardson,et al.  MouseMine: a new data warehouse for MGI , 2015, Mammalian Genome.

[76]  Philip E. Bourne,et al.  Ten Simple Rules for a Successful Collaboration , 2007, PLoS Comput. Biol..

[77]  Sergio Contrino,et al.  InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data , 2012, Bioinform..

[78]  Christine G. Elsik,et al.  Bovine Genome Database: new tools for gleaning function from the Bos taurus genome , 2015, Nucleic Acids Res..

[79]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[80]  Allyson L. Lister,et al.  BioSharing: curated and crowd-sourced metadata standards, databases and data policies in the life sciences , 2016, Database J. Biol. Databases Curation.

[81]  Michael C Whitlock,et al.  Data Archiving , 2010, The American Naturalist.

[82]  Massimiliano Izzo,et al.  FAIRsharing: working with and for the community to describe and link data standards, repositories and policies , 2018 .

[83]  Christine G. Elsik,et al.  Hymenoptera Genome Database: integrating genome annotations in HymenopteraMine , 2015, Nucleic Acids Res..

[84]  Peter D. Karp,et al.  How much does curation cost? , 2016, Database J. Biol. Databases Curation.

[85]  Matthew R. Hanlon,et al.  Araport: the Arabidopsis Information Portal , 2014, Nucleic Acids Res..

[86]  Tatiana A. Tatusova,et al.  BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata , 2011, Nucleic Acids Res..

[87]  Chris Mungall,et al.  A Chado case study: an ontology-based modular schema for representing genome-associated biological information , 2007, ISMB/ECCB.

[88]  Lukas A. Mueller,et al.  The Sol Genomics Network (SGN)—from genotype to phenotype to breeding , 2014, Nucleic Acids Res..

[89]  Stephen P. Ficklin,et al.  Tripal: a construction toolkit for online genome databases , 2011, Database J. Biol. Databases Curation.

[90]  Emily M. Strait,et al.  The arabidopsis information resource: Making and mining the “gold standard” annotated reference plant genome , 2015, Genesis.

[91]  Midori A. Harris,et al.  Canto: an online tool for community literature curation , 2014, Bioinform..

[92]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[93]  Bin Zhao,et al.  Ontobee: A linked ontology data server to support ontology term dereferencing, linkage, query and integration , 2016, Nucleic Acids Res..

[94]  Vivek Krishnakumar,et al.  MTGD: The Medicago truncatula genome database. , 2015, Plant & cell physiology.

[95]  Barry Smith,et al.  The Plant Ontology as a Tool for Comparative Plant Anatomy and Genomic Analyses , 2012, Plant & cell physiology.

[96]  Tatiana A. Tatusova,et al.  Complete genomes in WWW Entrez: data representation and analysis , 1999, Bioinform..

[97]  Kimberly Van Auken,et al.  WormBase 2014: new views of curated biology , 2013, Nucleic Acids Res..

[98]  Seung Yon Rhee,et al.  PubSearch and PubFetch: a simple management system for semiautomated retrieval and annotation of biological information from the literature. , 2006, Current protocols in bioinformatics.