SNP calling from RNA-seq data without a reference genome: identification, quantification, differential analysis and impact on the protein sequence

SNPs (Single Nucleotide Polymorphisms) are genetic markers whose precise identification is a prerequisite for association studies. Methods to identify them are currently well developed for model species, but rely on the availability of a (good) reference genome, and therefore cannot be applied to non-model species. They are also mostly tailored for whole genome (re-)sequencing experiments, whereas in many cases, transcriptome sequencing can be used as a cheaper alternative which already enables to identify SNPs located in transcribed regions. In this paper, we propose a method that identifies, quantifies and annotates SNPs without any reference genome, using RNA-seq data only. Individuals can be pooled prior to sequencing, if not enough material is available from one individual. Using pooled human RNA-seq data, we clarify the precision and recall of our method and discuss them with respect to other methods which use a reference genome or an assembled transcriptome. We then validate experimentally the predictions of our method using RNA-seq data from two non-model species. The method can be used for any species to annotate SNPs and predict their impact on the protein sequence. We further enable to test for the association of the identified SNPs with a phenotype of interest.

[1]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[2]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[3]  T. Markow,et al.  Evolutionary relationships of Drosophila mojavensis geographic host races and their sister species Drosophila arizonae , 2006, Molecular ecology.

[4]  L. Matzkin,et al.  Population genetics and geographic variation of alcohol dehydrogenase (Adh) paralogs and glucose-6-phosphate dehydrogenase (G6pd) in Drosophila mojavensis. , 2003, Molecular biology and evolution.

[5]  D. Charif,et al.  Wolbachia Interferes with Ferritin Expression and Iron Metabolism in Insects , 2009, PLoS pathogens.

[6]  Jun Lu,et al.  BMC Bioinformatics BioMed Central Methodology article Identifying differential expression in multiple SAGE libraries: an , 2005 .

[7]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[8]  Qin Gui,et al.  Expression changes of duplicated genes in allotetraploids of Brassica detected by SRAP-cDNA technique , 2009, Molecular Biology.

[9]  Laurent Modolo,et al.  UrQt: an efficient software for the Unsupervised Quality trimming of NGS data , 2015, BMC Bioinformatics.

[10]  Mark D. Robinson,et al.  Moderated statistical tests for assessing differences in tag abundance , 2007, Bioinform..

[11]  E. Pante,et al.  SNP Detection from De Novo Transcriptome Sequencing in the Bivalve Macoma balthica: Marker Development for Evolutionary Studies , 2012, PloS one.

[12]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[13]  Gregory Kucherov,et al.  Using cascading Bloom filters to improve the memory usage for de Brujin graphs , 2013, Algorithms for Molecular Biology.

[14]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[15]  M. Hochberg,et al.  Removing symbiotic Wolbachia bacteria specifically inhibits oogenesis in a parasitic wasp , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Vincent Miele,et al.  Navigating in a Sea of Repeats in RNA-seq without Drowning , 2014, WABI.

[17]  F. Vavre,et al.  INTRA‐INDIVIDUAL COEXISTENCE OF A WOLBACHIA STRAIN REQUIRED FOR HOST OOGENESIS WITH TWO STRAINS INDUCING CYTOPLASMIC INCOMPATIBILITY IN THE WASP ASOBARA TABIDA , 2004, Evolution; international journal of organic evolution.

[18]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[19]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[20]  Manju Bansal,et al.  A novel method for prokaryotic promoter prediction based on DNA stability , 2005, BMC Bioinformatics.

[21]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[22]  Jin Billy Li,et al.  RADAR: a rigorously annotated database of A-to-I RNA editing , 2013, Nucleic Acids Res..

[23]  David G. Knowles,et al.  Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs , 2012, Genome research.

[24]  J. Harrow,et al.  Systematic evaluation of spliced alignment programs for RNA-seq data , 2013, Nature Methods.

[25]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[26]  Rayan Chikhi,et al.  Reference-free detection of isolated SNPs , 2014, Nucleic acids research.

[27]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[28]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[29]  L. Duret,et al.  Comparative population genomics in animals uncovers the determinants of genetic diversity , 2014, Nature.

[30]  Marie-France Sagot,et al.  Theme: Computational Biology and Bioinformatics Computational Sciences for Biology, Medicine and the Environment , 2012 .

[31]  F. Hendrickx,et al.  De novo Transcriptome Assembly and SNP Discovery in the Wing Polymorphic Salt Marsh Beetle Pogonus chalceus (Coleoptera, Carabidae) , 2012, PloS one.

[32]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Marie-France Sagot,et al.  Identifying SNPs without a Reference Genome by Comparing Raw Reads , 2010, SPIRE.

[34]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[35]  Simon Anders,et al.  Analysing RNA-Seq data with the DESeq package , 2011 .

[36]  C. Schlötterer,et al.  Sequencing pools of individuals — mining genome-wide polymorphism data without big funding , 2014, Nature Reviews Genetics.

[37]  DO VARIABLE COMPENSATORY MECHANISMS EXPLAIN THE POLYMORPHISM OF THE DEPENDENCE PHENOTYPE IN THE ASOBARA TABIDA‐WOLBACHIA ASSOCIATION? , 2010, Evolution; international journal of organic evolution.

[38]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[39]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[40]  Zamin Iqbal,et al.  Identifying and Classifying Trait Linked Polymorphisms in Non-Reference Species by Walking Coloured de Bruijn Graphs , 2013, PloS one.

[41]  Alexander Woywodt,et al.  Conflict of interest statement. None declared. , 2008 .

[42]  D. Stauffer,et al.  Population genetics and geographic variation of alcohol dehydrogenase ( Adh ) paralogs and glucose-6-phosphate dehydrogenase ( G 6 pd ) in Drosophila mojavensis , 2003 .

[43]  Jin Billy Li,et al.  Reliable identification of genomic variants from RNA-seq data. , 2013, American journal of human genetics.

[44]  Pablo Cingolani,et al.  © 2012 Landes Bioscience. Do not distribute. , 2022 .

[45]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.