High density genotype storage for plant breeding in the Chado schema of Breedbase

Modern breeding programs routinely use genome-wide information for selecting individuals to advance. The large volumes of genotypic information required present a challenge for data storage and query efficiency. Major use cases require genotyping data to be linked with trait phenotyping data. In contrast to phenotyping data that are often stored in relational database schemas, next-generation genotyping data are traditionally stored in non-relational storage systems due to their extremely large scope. This study presents a novel data model implemented in Breedbase (https://breedbase.org/) for uniting relational phenotyping data and non-relational genotyping data within the open-source PostgreSQL database engine. Breedbase is an open-source, web-database designed to manage all of a breeder’s informatics needs: management of field experiments, phenotypic and genotypic data collection and storage, and statistical analyses. The genotyping data is stored in a PostgreSQL data-type known as binary JavaScript Object Notation (JSONb), where the JSON structures closely follow the Variant Call Format (VCF) data model. The Breedbase genotyping data model can handle different ploidy levels, structural variants, and any genotype encoded in VCF. JSONb is both compressed and indexed, resulting in a space and time efficient system. Furthermore, file caching maximizes data retrieval performance. Integration of all breeding data within the Chado database schema retains referential integrity that may be lost when genotyping and phenotyping data are stored in separate systems. Benchmarking demonstrates that the system is fast enough for computation of a genomic relationship matrix (GRM) and genome wide association study (GWAS) for datasets involving 1,325 diploid Zea mays, 314 triploid Musa acuminata, and 924 diploid Manihot esculenta samples genotyped with 955,690, 142,119, and 287,952 genotype-by-sequencing (GBS) markers, respectively.

[1]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[2]  Valentin Guignon,et al.  Benchmarking database systems for Genomic Selection implementation , 2019, Database : the journal of biological databases and curation.

[3]  Neil Matthew,et al.  Beginning Databases With Postgresql: From Novice To Professional (Beginning from Novice to Professional) , 2005 .

[4]  Patrick S. Schnable,et al.  Maize genomes to fields (G2F): 2014–2017 field seasons: genotype, phenotype, climatic, soil, and inbred ear image datasets , 2020, BMC Research Notes.

[5]  Jeffrey B. Endelman,et al.  Ridge Regression and Other Kernels for Genomic Selection with R Package rrBLUP , 2011 .

[6]  Michael J. Thomson,et al.  High-Throughput SNP Genotyping to Accelerate Crop Improvement , 2014 .

[7]  Robert M. Buels,et al.  The Sol Genomics Network (solgenomics.net): growing tomatoes using Perl , 2010, Nucleic Acids Res..

[8]  Rajeev K Varshney,et al.  Crop Breeding Chips and Genotyping Platforms: Progress, Challenges, and Perspectives. , 2017, Molecular plant.

[9]  Chris Mungall,et al.  A Chado case study: an ontology-based modular schema for representing genome-associated biological information , 2007, ISMB/ECCB.

[10]  Lukas A. Mueller,et al.  The Sol Genomics Network (SGN)—from genotype to phenotype to breeding , 2014, Nucleic Acids Res..

[11]  Lukas A. Mueller,et al.  solGS: a web-based tool for genomic selection , 2014, BMC Bioinformatics.

[12]  Arllet M. Portugal,et al.  Bridging the phenotypic and genetic data useful for integrated breeding through a data annotation using the Crop Ontology developed by the crop communities of practice , 2012, Front. Physio..

[13]  C. Petroli,et al.  The Development of Quality Control Genotyping Approaches: A Case Study Using Elite Maize Lines , 2016, PloS one.

[14]  David Osumi-Sutherland,et al.  FlyBase: enhancing Drosophila Gene Ontology annotations , 2008, Nucleic Acids Res..

[15]  P. VanRaden,et al.  Efficient methods to compute genomic predictions. , 2008, Journal of dairy science.

[16]  Jean-Luc Jannink,et al.  Genomic selection in plant breeding. , 2014, Methods in molecular biology.

[17]  Valentin Guignon,et al.  Benchmarking database systems for Genomic Selection implementation , 2019, bioRxiv.

[18]  Robert M. Buels,et al.  The SOL Genomics Network Model: Making Community Annotation Work , 2009 .

[19]  Michael E. Goddard,et al.  Genomic selection: A paradigm shift in animal breeding , 2016 .

[20]  Pierre Larmande,et al.  Gigwa—Genotype investigator for genome-wide analyses , 2016, GigaScience.

[21]  Felipe Meneguzzi,et al.  NeuroView: a customizable browser-base utility , 2016 .

[22]  Robert M. Buels,et al.  The Chado Natural Diversity module: a new generic database schema for large-scale phenotyping and genotyping data , 2011, Database J. Biol. Databases Curation.

[23]  Sewall Wright,et al.  Coefficients of Inbreeding and Relationship , 1922, The American Naturalist.

[24]  Robert J. Elshire,et al.  A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species , 2011, PloS one.

[25]  M. Gore,et al.  Genomic characterization of Ugandan smallholder farmer‐preferred cassava varieties , 2020, Crop science.

[26]  J. Lorenzen,et al.  Genomic Prediction in a Multiploid Crop: Genotype by Environment Interaction and Allele Dosage Effects on Predictive Ability in Banana , 2018, The plant genome.

[27]  R. Neal PROJECT ADMINISTRATION , 2009 .

[28]  Robin Thompson,et al.  ASREML user guide release 1.0 , 2002 .

[29]  Uwe Scholz,et al.  BrAPI—an application programming interface for plant breeding applications , 2019, Bioinform..