Variant calling using NGS and sequence capture data for population and evolutionary genomic inferences in Norway Spruce (Picea abies)

Advances in next-generation sequencing methods and the development of new statistical and computational methods have opened up possibilities made for large-scale, high quality genotyping in most organisms. Conifer genomes are large and are known to contain a high fraction of repetitive elements and this complex genome structure has bearings for approaches that aim to use next-generation sequencing methods for genotyping. In this chapter we provide a detailed description of a workflow for variant calling using next-generation sequencing in Norway spruce (Picea abies). The workflow that starts with raw sequencing reads and proceeds through read mapping to variant calling and variant filtering. We illustrate the pipeline using data derived from both whole-genome resequencing data and reduced-representation sequencing. We highlight possible problems and pitfalls of using next-generation sequencing data for genotyping stemming from the complex genome structure of conifers and how those issues can be mitigated or eliminated.

[1]  J. M. Smith,et al.  The hitch-hiking effect of a favourable gene. , 1974, Genetical research.

[2]  D. Hartl,et al.  Principles of population genetics , 1981 .

[3]  F. Tajima Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. , 1989, Genetics.

[4]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[5]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[6]  A. Syvänen Toward genome-wide SNP genotyping , 2005, Nature Genetics.

[7]  D. Charlesworth Balancing Selection and Its Effects on Sequences in Nearby Genome Regions , 2006, PLoS genetics.

[8]  M. Morgante,et al.  Multilocus Patterns of Nucleotide Diversity, Linkage Disequilibrium and Demographic History of Norway Spruce [Picea abies (L.) Karst] , 2006, Genetics.

[9]  J. Jurka,et al.  Repetitive sequences in complex genomes: structure and evolution. , 2007, Annual review of genomics and human genetics.

[10]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[11]  M. Marra,et al.  Applications of next-generation sequencing technologies in functional genomics. , 2008, Genomics.

[12]  Hugo Y. K. Lam,et al.  Analysis of copy number variants and segmental duplications in the human genome: Evidence for a change in the process of formation in recent evolutionary history. , 2008, Genome research.

[13]  Ruiqiang Li,et al.  SOAP: short oligonucleotide alignment program , 2008, Bioinform..

[14]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[15]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[16]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[17]  Paul Flicek,et al.  Sense from sequence reads: methods for alignment and assembly , 2009, Nature Methods.

[18]  Michael Brudno,et al.  SHRiMP: Accurate Mapping of Short Color-space Reads , 2009, PLoS Comput. Biol..

[19]  Huanming Yang,et al.  SNP detection for massively parallel whole-genome resequencing. , 2009, Genome research.

[20]  Cole Trapnell,et al.  How to map billions of short reads onto genomes , 2009, Nature Biotechnology.

[21]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[22]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[23]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[24]  P. Flicek,et al.  The need for speed , 2009, Genome Biology.

[25]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[26]  S. Nelson,et al.  Improved variant discovery through local re-alignment of short-read next-generation sequencing data using SRMA , 2010, Genome Biology.

[27]  Visualization of image data from cells to organisms , 2010, Nature Methods.

[28]  P. Capy,et al.  The struggle for life of the genome's selfish architects , 2011, Biology Direct.

[29]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[30]  E. Birney,et al.  Sense from sequence reads: methods for alignment and assembly , 2010, Nature Methods.

[31]  Michael A. Schmidt,et al.  SeqEM: an adaptive genotype-calling approach for next-generation sequencing studies , 2010, Bioinform..

[32]  C. E. Pearson,et al.  Table S2: Trans-factors and trinucleotide repeat instability Trans-factor , 2010 .

[33]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[34]  R. Britten Transposable element insertions have strongly affected human evolution , 2010, Proceedings of the National Academy of Sciences.

[35]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[36]  You-Qiang Song,et al.  Evaluation of next-generation sequencing software in mapping and assembly , 2011, Journal of Human Genetics.

[37]  Brent S. Pedersen,et al.  Pybedtools: a flexible Python library for manipulating genomic datasets and annotations , 2011, Bioinform..

[38]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[39]  Martin Goodson,et al.  Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. , 2011, Genome research.

[40]  Joshua S. Paul,et al.  Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.

[41]  Lin Liu,et al.  Comparison of Next-Generation Sequencing Systems , 2012, Journal of biomedicine & biotechnology.

[42]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2011, Nature Reviews Genetics.

[43]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[44]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[45]  Daniel M Bader,et al.  A beginners guide to SNP calling from high-throughput DNA-sequencing data , 2012, Human Genetics.

[46]  D. Neale,et al.  Disentangling the Roles of History and Local Selection in Shaping Clinal Variation of Allele Frequencies and Gene Expression in Norway Spruce (Picea abies) , 2012, Genetics.

[47]  Sanghyun Park,et al.  A survey of sequence alignment algorithms for next-generation sequencing read , 2012 .

[48]  N. Galtier,et al.  Reference-Free Population Genomics from Next-Generation Transcriptome Data and the Vertebrate–Invertebrate Gap , 2013, PLoS genetics.

[49]  Wei Chen,et al.  Single Nucleotide Polymorphism (SNP) Detection and Genotype Calling from Massively Parallel Sequencing (MPS) Data , 2012, Statistics in Biosciences.

[50]  Douglas G. Scofield,et al.  The Norway spruce genome sequence and conifer genome evolution , 2013, Nature.

[51]  J. Potash,et al.  Validation and assessment of variant calling pipelines for next-generation sequencing , 2014, Human Genomics.

[52]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[53]  S. Salzberg,et al.  Sequencing and Assembly of the 22-Gb Loblolly Pine Genome , 2014, Genetics.

[54]  M. Mielczarek,et al.  Review of alignment and SNP calling algorithms for next-generation sequencing data , 2015, Journal of Applied Genetics.

[55]  C. Varotto,et al.  Advances in the Understanding of Biological Sciences Using Next Generation Sequencing (NGS) Approaches , 2015, Springer International Publishing.

[56]  Douglas G. Scofield,et al.  Variant calling using NGS data in European aspen (Populus tremula) , 2015 .

[57]  S. Salzberg,et al.  Sequence of the Sugar Pine Megagenome , 2016, Genetics.

[58]  D. Posada,et al.  A comparison of tools for the simulation of genomic next-generation sequencing data , 2016, Nature Reviews Genetics.

[59]  Perry G. Ridge,et al.  Evaluating the necessity of PCR duplicate removal from next-generation sequencing data and a comparison of approaches , 2016, BMC Bioinformatics.

[60]  G. Luikart,et al.  Harnessing the power of RADseq for ecological and evolutionary genomics , 2016, Nature Reviews Genetics.

[61]  J. McPherson,et al.  Coming of age: ten years of next-generation sequencing technologies , 2016, Nature Reviews Genetics.

[62]  E. Mardis DNA sequencing technologies: 2006–2016 , 2017, Nature Protocols.

[63]  Tomasz E. Koralewski,et al.  The Douglas-Fir Genome Sequence Reveals Specialization of the Photosynthetic Apparatus in Pinaceae , 2017, G3: Genes, Genomes, Genetics.

[64]  C. Casola,et al.  LTR Retrotransposons Show Low Levels of Unequal Recombination and High Rates of Intraelement Gene Conversion in Large Plant Genomes , 2017, Genome biology and evolution.

[65]  S. Salzberg,et al.  An improved assembly of the loblolly pine mega-genome using long-read single-molecule sequencing , 2017, GigaScience.

[66]  L. Seeb,et al.  Paralogs are revealed by proportion of heterozygotes and deviations in read ratios in genotyping‐by‐sequencing data from natural populations , 2017, Molecular ecology resources.

[67]  Zhiping Weng,et al.  Mapping Billions of Short Reads to a Reference Genome. , 2017, Cold Spring Harbor protocols.

[68]  S. Oliver,et al.  Estimating the total number of phosphoproteins and phosphorylation sites in eukaryotic proteomes , 2017, GigaScience.

[69]  An ultra-dense haploid genetic map for evaluating the highly fragmented genome assembly of Norway spruce (Picea abies) , 2018 .

[70]  Douglas G. Scofield,et al.  Design and evaluation of a large sequence-capture probe set and associated SNPs for diploid and haploid samples of Norway spruce (Picea abies) , 2018, bioRxiv.

[71]  Chrom , 2018 .

[72]  T. Ruttink,et al.  Utilization of tissue ploidy level variation in de novo transcriptome assembly of Pinus sylvestris , 2018 .

[73]  M. Sillanpää,et al.  Association mapping identified novel candidate loci affecting wood formation in Norway spruce , 2018, bioRxiv.

[74]  Douglas G. Scofield,et al.  An Ultra-Dense Haploid Genetic Map for Evaluating the Highly Fragmented Genome Assembly of Norway Spruce (Picea abies) , 2018, G3: Genes, Genomes, Genetics.