Leveraging known genomic variants to improve detection of variants, especially close-by Indels

Motivation The detection of genomic variants has great significance in genomics, bioinformatics, biomedical research and its applications. However, despite a lot of effort, Indels and structural variants are still under-characterized compared to SNPs. Current approaches based on next-generation sequencing data usually require large numbers of reads (high coverage) to be able to detect such types of variants accurately. However Indels, especially those close to each other, are still hard to detect accurately. Results We introduce a novel approach that leverages known variant information, e.g. provided by dbSNP, dbVar, ExAC or the 1000 Genomes Project, to improve sensitivity of detecting variants, especially close-by Indels. In our approach, the standard reference genome and the known variants are combined to build a meta-reference, which is expected to be probabilistically closer to the subject genomes than the standard reference. An alignment algorithm, which can take into account known variant information, is developed to accurately align reads to the meta-reference. This strategy resulted in accurate alignment and variant calling even with low coverage data. We showed that compared to popular methods such as GATK and SAMtools, our method significantly improves the sensitivity of detecting variants, especially Indels that are close to each other. In particular, our method was able to call these close-by Indels at a 15-20% higher sensitivity than other methods at low coverage, and still get 1-5% higher sensitivity at high coverage, at competitive precision. These results were validated using simulated data with variant profiles extracted from the 1000 Genomes Project data, and real data from the Illumina Platinum Genomes Project and ExAC database. Our finding suggests that by incorporating known variant information in an appropriate manner, sensitive variant calling is possible at a low cost. Availability and implementation Implementation can be found in our public code repository https://github.com/namsyvo/IVC. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Ryan D. Hernandez,et al.  A Fine-Scale Chimpanzee Genetic Map from Population Sequencing , 2012, Science.

[2]  Andrei L. Turinsky,et al.  The missing indels: an estimate of indel variation in a human genome and analysis of factors that impede detection , 2015, Nucleic acids research.

[3]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[4]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[5]  Xiaoqing Yu,et al.  Comparing a few SNP calling algorithms using low-coverage sequencing data , 2013, BMC Bioinformatics.

[6]  Chao Chen,et al.  dbVar and DGVa: public archives for genomic structural variation , 2012, Nucleic Acids Res..

[7]  N. Warthmann,et al.  Simultaneous alignment of short reads against multiple genomes , 2009, Genome Biology.

[8]  Jessica C. Ebert,et al.  Computational Techniques for Human Genome Resequencing Using Mated Gapped Reads , 2012, J. Comput. Biol..

[9]  Joel Gelernter,et al.  Variant Callers for Next-Generation Sequencing Data: A Comparison Study , 2013, PloS one.

[10]  James Y. Zou Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[11]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[12]  Heng Li,et al.  FermiKit: assembly-based variant calling for Illumina resequencing data , 2015, Bioinform..

[13]  Iman Hajirasouliha,et al.  MATE-CLEVER: Mendelian-inheritance-aware discovery and genotyping of midsize and long indels , 2013, Bioinform..

[14]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[15]  R. Durbin,et al.  Dindel: accurate indel calls from short-read data. , 2011, Genome research.

[16]  Chittibabu Guda,et al.  A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference , 2015, BioMed research international.

[17]  Nobal B. Niraula,et al.  RandAL: a randomized approach to aligning DNA sequences to reference genomes , 2014, BMC Genomics.

[18]  Michael R. Speicher,et al.  A survey of tools for variant analysis of next-generation genome sequencing data , 2013, Briefings Bioinform..

[19]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[20]  Matthew D. Wilkerson,et al.  ABRA: improved coding indel detection via assembly-based realignment , 2014, Bioinform..

[21]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[22]  Olivier Harismendy,et al.  Accurate detection and genotyping of SNPs utilizing population sequencing data. , 2010, Genome research.

[23]  Huanming Yang,et al.  SNP detection for massively parallel whole-genome resequencing. , 2009, Genome research.

[24]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[25]  Chris Thachuk Succincter Text Indexing with Wildcards , 2011, CPM.

[26]  James Lu,et al.  An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data , 2013, Genome research.

[27]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[28]  Yongchao Liu,et al.  Long read alignment based on maximal exact match seeds , 2012, Bioinform..

[29]  Lin Huang,et al.  Short read alignment with populations of genomes , 2013, Bioinform..

[30]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[31]  M. Schatz,et al.  Accurate detection of de novo and transmitted indels within exome-capture data using micro-assembly , 2014, Nature Methods.

[32]  Aleksandar Milosavljevic,et al.  An integrative variant analysis suite for whole exome next-generation sequencing data , 2012, BMC Bioinformatics.

[33]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[34]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.

[35]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[36]  Jie Ding,et al.  Estimation of sequencing error rates in short reads , 2012, BMC Bioinformatics.

[37]  Heng Li,et al.  Toward better understanding of artifacts in variant calling from high-coverage samples , 2014, Bioinform..

[38]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[39]  G. Weinstock,et al.  A SNP discovery method to assess variant allele probability from next-generation resequencing data. , 2010, Genome research.