A high-throughput SNP discovery strategy for RNA-seq data

BackgroundSingle nucleotide polymorphisms (SNP) have been applied as important molecular markers in genetics and breeding studies. The rapid advance of next generation sequencing (NGS) provides a high-throughput means of SNP discovery. However, SNP development is limited by the availability of reliable SNP discovery methods. Especially, the optimum assembler and SNP caller for accurate SNP prediction from next generation sequencing data are not known.ResultsHerein we performed SNP prediction based on RNA-seq data of peach and mandarin peel tissue under a comprehensive comparison of two paired-end read lengths (125 bp and 150 bp), five assemblers (Trinity, IDBA, oases, SOAPdenovo, Trans-abyss) and two SNP callers (GATK and GBS). The predicted SNPs were compared with the authentic SNPs identified via PCR amplification followed by gene cloning and sequencing procedures. A total of 40 and 240 authentic SNPs were presented in five anthocyanin biosynthesis related genes in peach and in nine carotenogenic genes in mandarin. Putative SNPs predicted from the same RNA-seq data with different strategies led to quite divergent results. The rate of false positive SNPs was significantly lower when the paired-end read length was 150 bp compared with 125 bp. Trinity was superior to the other four assemblers and GATK was substantially superior to GBS due to a low rate of missing authentic SNPs. The combination of assembler Trinity, SNP caller GATK, and the paired-end read length 150 bp had the best performance in SNP discovery with 100% accuracy both in peach and in mandarin cases. This strategy was applied to the characterization of SNPs in peach and mandarin transcriptomes.ConclusionsThrough comparison of authentic SNPs obtained by PCR cloning strategy and putative SNPs predicted from different combinations of five assemblers, two SNP callers, and two paired-end read lengths, we provided a reliable and efficient strategy, Trinity-GATK with 150 bp paired-end read length, for SNP discovery from RNA-seq data. This strategy discovered SNP at 100% accuracy in peach and mandarin cases and might be applicable to a wide range of plants and other organisms.

[1]  J. Udall,et al.  Single‐Nucleotide Polymorphism Genotyping in Mapping Populations via Genomic Reduction and Next‐Generation Sequencing: Proof of Concept , 2010 .

[2]  Yang Yu,et al.  SNP Discovery in the Transcriptome of White Pacific Shrimp Litopenaeus vannamei by Next Generation Sequencing , 2014, PloS one.

[3]  J. Sambrook,et al.  Molecular Cloning: A Laboratory Manual , 2001 .

[4]  T. Mitchell-Olds,et al.  Genetic mechanisms and evolutionary significance of natural variation in Arabidopsis , 2006, Nature.

[5]  M. Gill,et al.  Development of Strategies for SNP Detection in RNA-Seq Data: Application to Lymphoblastoid Cell Lines and Evaluation Using 1000 Genomes Data , 2013, PloS one.

[6]  Arthur T. O. Melo,et al.  GBS-SNP-CROP: a reference-optional pipeline for SNP discovery and plant germplasm characterization using variable length, paired-end genotyping-by-sequencing data , 2016, BMC Bioinformatics.

[7]  Mark J. P. Chaisson,et al.  De novo fragment assembly with short mate-paired reads: Does the read length matter? , 2009, Genome research.

[8]  Paul Flicek,et al.  Sense from sequence reads: methods for alignment and assembly , 2009, Nature Methods.

[9]  R. Lyons,et al.  Optimizing Hybrid de Novo Transcriptome Assembly and Extending Genomic Resources for Giant Freshwater Prawns (Macrobrachium rosenbergii): The Identification of Genes and Markers Associated with Reproduction , 2016, International journal of molecular sciences.

[10]  Kun-song Chen,et al.  [An efficient macro-method of genomic DNA isolation from Actinidia chinensis leaves]. , 2004, Yi chuan = Hereditas.

[11]  Yu-Shu Lo,et al.  Genome-wide structural modelling of TCR-pMHC interactions , 2013, BMC Genomics.

[12]  Wenwei Zhang,et al.  OTG-snpcaller: An Optimized Pipeline Based on TMAP and GATK for SNP Calling from Ion Torrent Data , 2014, PloS one.

[13]  Chongde Sun,et al.  Characterization of cDNAs associated with lignification and their expression profiles in loquat fruit with different lignin accumulation , 2008, Planta.

[14]  Yan Long,et al.  Single nucleotide polymorphism (SNP) discovery in the polyploid Brassica napus using Solexa transcriptome sequencing. , 2009, Plant biotechnology journal.

[15]  Niko Beerenwinkel,et al.  Read length versus Depth of Coverage for Viral Quasispecies Reconstruction , 2012, PloS one.

[16]  S. Lakhanpaul,et al.  Single nucleotide polymorphism (SNP)–Methods and applications in plant genetics: A review , 2006 .

[17]  Joshua S. Paul,et al.  Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.

[18]  K. Xu,et al.  Distinct Carotenoid and Flavonoid Accumulation in a Spontaneous Mutant of Ponkan (Citrus reticulata Blanco) Results in Yellowish Fruit and Enhanced Postharvest Resistance. , 2015, Journal of agricultural and food chemistry.

[19]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[20]  N. Rodríguez‐Ezpeleta,et al.  Bioinformatics for High Throughput Sequencing , 2012, Springer New York.

[21]  T. Strom,et al.  Comparison among three variant callers and assessment of the accuracy of imputation from SNP array data to whole-genome sequence level in chicken , 2015, BMC Genomics.

[22]  Zhanjiang Liu,et al.  Efficient assembly and annotation of the transcriptome of catfish by RNA-Seq analysis of a doubled haploid homozygote , 2012, BMC Genomics.

[23]  Jian Xu,et al.  SNP calling using genotype model selection on high-throughput sequencing data , 2012, Bioinform..

[24]  J. Dopazo,et al.  Genomics of the origin and evolution of Citrus , 2018, Nature.

[25]  I. Rigoutsos,et al.  The complex transcriptional landscape of the anucleate human platelet , 2013, BMC Genomics.

[26]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[27]  C. Mason,et al.  The impact of read length on quantification of differentially expressed genes and splice junction detection , 2015, Genome Biology.

[28]  A. Brookes The essence of SNPs. , 1999, Gene.

[29]  Jun Wang,et al.  SNP Calling, Genotype Calling, and Sample Allele Frequency Estimation from New-Generation Sequencing Data , 2012, PloS one.

[30]  J. Montoya-Burgos,et al.  Optimization of de novo transcriptome assembly from next-generation sequencing data. , 2010, Genome research.

[31]  Huanming Yang,et al.  SNP detection for massively parallel whole-genome resequencing. , 2009, Genome research.

[32]  Francisco Manzano-Agugliaro,et al.  Trends in plant research using molecular markers , 2018, Planta.

[33]  Kun-song Chen,et al.  Differential Sensitivity of Fruit Pigmentation to Ultraviolet Light between Two Peach Cultivars , 2017, Front. Plant Sci..

[34]  M. Blaxter,et al.  Genome-wide genetic marker discovery and genotyping using next-generation sequencing , 2011, Nature Reviews Genetics.

[35]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[36]  Timothy B. Stockwell,et al.  Evaluation of next generation sequencing platforms for population targeted sequencing studies , 2009, Genome Biology.

[37]  M. Ashburner A Laboratory manual , 1989 .

[38]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[39]  L. Seeb,et al.  Single‐nucleotide polymorphism (SNP) discovery and applications of SNP genotyping in nonmodel organisms , 2011, Molecular ecology resources.

[40]  Martin I. Taylor,et al.  Novel Tools for Conservation Genomics: Comparing Two High-Throughput Approaches for SNP Discovery in the Transcriptome of the European Hake , 2011, PloS one.

[41]  Ashish Kumar,et al.  Large-scale development of cost-effective SNP marker assays for diversity assessment and genetic mapping in chickpea and comparative mapping in legumes , 2012, Plant biotechnology journal.

[42]  Sylvie Cloutier,et al.  SNP Discovery through Next-Generation Sequencing and Its Applications , 2012, International journal of plant genomics.

[43]  Guojun Li,et al.  The Impacts of Read Length and Transcriptome Complexity for De Novo Assembly: A Simulation Study , 2014, PloS one.