TASSEL-GBS: A High Capacity Genotyping by Sequencing Analysis Pipeline

Genotyping by sequencing (GBS) is a next generation sequencing based method that takes advantage of reduced representation to enable high throughput genotyping of large numbers of individuals at a large number of SNP markers. The relatively straightforward, robust, and cost-effective GBS protocol is currently being applied in numerous species by a large number of researchers. Herein we describe a bioinformatics pipeline, tassel-gbs, designed for the efficient processing of raw GBS sequence data into SNP genotypes. The tassel-gbs pipeline successfully fulfills the following key design criteria: (1) Ability to run on the modest computing resources that are typically available to small breeding or ecological research programs, including desktop or laptop machines with only 8–16 GB of RAM, (2) Scalability from small to extremely large studies, where hundreds of thousands or even millions of SNPs can be scored in up to 100,000 individuals (e.g., for large breeding programs or genetic surveys), and (3) Applicability in an accelerated breeding context, requiring rapid turnover from tissue collection to genotypes. Although a reference genome is required, the pipeline can also be run with an unfinished “pseudo-reference” consisting of numerous contigs. We describe the tassel-gbs pipeline in detail and benchmark it based upon a large scale, species wide analysis in maize (Zea mays), where the average error rate was reduced to 0.0042 through application of population genetic-based SNP filters. Overall, the GBS assay and the tassel-gbs pipeline provide robust tools for studying genomic diversity.

[1]  J. Poland,et al.  Application of Genotyping-by-Sequencing on Semiconductor Sequencing Platforms: A Comparison of Genetic and Reference-Based Marker Ordering in Barley , 2013, PloS one.

[2]  Zhiwu Zhang,et al.  Genotyping by Genome Reducing and Sequencing for Outbred Animals , 2013, PloS one.

[3]  S. M. Sahraeian,et al.  Digital genotyping of sorghum – a diverse plant species with a large repeat-rich genome , 2013, BMC Genomics.

[4]  H. Gauch,et al.  Relatedness and Genotype × Environment Interaction Affect Prediction Accuracies in Genomic Selection: A Study in Cassava , 2013 .

[5]  Shichen Wang,et al.  Sequence-Based Mapping of the Polyploid Wheat Genome , 2013, G3: Genes, Genomes, Genetics.

[6]  M. Sogin,et al.  A Filtering Method to Generate High Quality Short Reads Using Illumina Paired-End Technology , 2013, PloS one.

[7]  Robert J. Elshire,et al.  Comprehensive genotyping of the USA national maize inbred seed bank , 2013, Genome Biology.

[8]  Zachariah Gompert,et al.  Population genomics based on low coverage sequencing: how low should we go? , 2013, Molecular ecology.

[9]  W. Cresko,et al.  The population structure and recent colonization history of Oregon threespine stickleback determined using restriction‐site associated DNA‐sequencing , 2013, Molecular ecology.

[10]  T. Cezard,et al.  The effect of RAD allele dropout on the estimation of genetic variation within and between populations , 2013, Molecular ecology.

[11]  S. Narum,et al.  Genotyping‐by‐sequencing in ecological and conservation genomics , 2013, Molecular ecology.

[12]  Russell B. Corbett-Detig,et al.  RADseq underestimates diversity and introduces genealogical biases due to nonrandom haplotype sampling , 2013, Molecular ecology.

[13]  T. White,et al.  Adaptive evolution during an ongoing range expansion: the invasive bank vole (Myodes glareolus) in Ireland , 2013, Molecular ecology.

[14]  Sunday O. Peters,et al.  Genotyping-by-Sequencing (GBS): A Novel, Efficient and Cost-Effective Genotyping Method for Cattle Using Next-Generation Sequencing , 2013, PloS one.

[15]  R. Moritz,et al.  RESTseq – Efficient Benchtop Population Genomics with RESTriction Fragment SEQuencing , 2013, PloS one.

[16]  Alexander Schönhuth,et al.  Discovering motifs that induce sequencing errors , 2013, BMC Bioinformatics.

[17]  T. Tai,et al.  Identification of SNPs in Closely Related Temperate Japonica Rice Cultivars Using Restriction Enzyme-Phased Sequencing , 2013, PloS one.

[18]  Rod A Wing,et al.  Aluminum tolerance in maize is associated with higher MATE1 gene copy number , 2013, Proceedings of the National Academy of Sciences.

[19]  Jean-Luc Jannink,et al.  Imputation of Unordered Markers and the Impact on Genomic Selection Accuracy , 2013, G3: Genes, Genomes, Genetics.

[20]  W. Cowling Sustainable plant breeding , 2013 .

[21]  Brian Boyle,et al.  An Improved Genotyping by Sequencing (GBS) Approach Offering Increased Versatility and Efficiency of SNP Discovery and Genotyping , 2013, PloS one.

[22]  A. Sørensen Sequence-based Genotyping for Marker Discovery and Co-dominant Scoring in Germplasm and Populations , 2013 .

[23]  Robert J. Elshire,et al.  Switchgrass Genomic Diversity, Ploidy, and Evolution: Novel Insights from a Network-Based SNP Discovery Protocol , 2013, PLoS genetics.

[24]  C. T. Hash,et al.  Population genomic and genome-wide association studies of agroclimatic traits in sorghum , 2012, Proceedings of the National Academy of Sciences.

[25]  Trevor W. Rife,et al.  Genotyping‐by‐Sequencing for Plant Breeding and Genetics , 2012 .

[26]  Erez Lieberman Aiden,et al.  The expanding scope of DNA sequencing , 2012, Nature Biotechnology.

[27]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[28]  Andreas Prlic,et al.  BioJava: an open-source framework for bioinformatics in 2012 , 2012, Bioinform..

[29]  Peter J. Bradbury,et al.  Maize HapMap2 identifies extant variation from a genome in flux , 2012, Nature Genetics.

[30]  M. Matz,et al.  2b-RAD: a simple and flexible method for genome-wide genotyping , 2012, Nature Methods.

[31]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[32]  J. Poland,et al.  Development of High-Density Genetic Maps for Barley and Wheat Using a Novel Two-Enzyme Genotyping-by-Sequencing Approach , 2012, PloS one.

[33]  T. Thomas,et al.  GemSIM: general, error-model based simulator of next-generation sequencing data , 2012, BMC Genomics.

[34]  Joseph N. Fass,et al.  Reference genome-independent assessment of mutation density using restriction enzyme-phased sequencing , 2012, BMC Genomics.

[35]  J. Marchini,et al.  Genotype Imputation with Thousands of Genomes , 2011, G3: Genes | Genomes | Genetics.

[36]  A. Amores,et al.  Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences , 2011, G3: Genes | Genomes | Genetics.

[37]  R. Durbin,et al.  Inference of human population history from individual whole-genome sequences. , 2011, Nature.

[38]  M. Blaxter,et al.  Genome-wide genetic marker discovery and genotyping using next-generation sequencing , 2011, Nature Reviews Genetics.

[39]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[40]  Robert J. Elshire,et al.  A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species , 2011, PloS one.

[41]  Tina T. Hu,et al.  Multiplexed shotgun genotyping for rapid and efficient genetic mapping. , 2011, Genome research.

[42]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[43]  Peter J. Bradbury,et al.  Genome-wide association study of leaf architecture in the maize nested association mapping population , 2011, Nature Genetics.

[44]  R. Nielsen,et al.  Ascertainment biases in SNP chips affect measures of population divergence. , 2010, Molecular biology and evolution.

[45]  J. Marchini,et al.  Genotype imputation for genome-wide association studies , 2010, Nature Reviews Genetics.

[46]  Qi Feng,et al.  Parent-independent genotyping for constructing an ultrahigh-density linkage map based on population sequencing , 2010, Proceedings of the National Academy of Sciences.

[47]  J. Rafalski,et al.  Association genetics in crop improvement. , 2010, Current opinion in plant biology.

[48]  Nicholas Stiffler,et al.  Population Genomics of Parallel Adaptation in Threespine Stickleback using Sequenced RAD Tags , 2010, PLoS genetics.

[49]  M. McMullen,et al.  Genetic Properties of the Maize Nested Association Mapping Population , 2009, Science.

[50]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[51]  Huanming Yang,et al.  SNP detection for massively parallel whole-genome resequencing. , 2009, Genome research.

[52]  P. Etter,et al.  Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD Markers , 2008, PloS one.

[53]  Andreas Prlic,et al.  Sequence analysis , 2003 .

[54]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[55]  Robert D Schnabel,et al.  SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries , 2008, Nature Methods.

[56]  Jan van Oeveren,et al.  Complexity Reduction of Polymorphic Sequences (CRoPS™): A Novel Approach for Large-Scale Polymorphism Discovery in Complex Genomes , 2007, PloS one.

[57]  Edward S. Buckler,et al.  TASSEL: software for association mapping of complex traits in diverse samples , 2007, Bioinform..

[58]  Carlos D Bustamante,et al.  Ascertainment bias in studies of human genome-wide polymorphism. , 2005, Genome research.

[59]  Keith A. Gardner,et al.  Origin of extant domesticated sunflowers in eastern North America , 2004, Nature.

[60]  E. Eller Effects of Ascertainment Bias on Recovering Human Demographic History , 2001, Human biology.

[61]  Eric S. Lander,et al.  An SNP map of the human genome generated by reduced representation shotgun sequencing , 2000, Nature.

[62]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[63]  J. Barrett,et al.  How next-generation sequencing is transforming complex disease genetics. , 2013, Trends in genetics : TIG.

[64]  J. Batley,et al.  Accessing complex crop genomes with next-generation sequencing , 2012, Theoretical and Applied Genetics.

[65]  F. Kopisch-Obuch,et al.  Effect of crop improvement on genetic diversity in oilseedBrassica rapa (turnip-rape) cultivars, detected by SSR markers , 2010, Journal of Applied Genetics.

[66]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.