论文信息 - Leveraging known genomic variants to improve detection of variants, especially close-by Indels

Leveraging known genomic variants to improve detection of variants, especially close-by Indels

Motivation The detection of genomic variants has great significance in genomics, bioinformatics, biomedical research and its applications. However, despite a lot of effort, Indels and structural variants are still under-characterized compared to SNPs. Current approaches based on next-generation sequencing data usually require large numbers of reads (high coverage) to be able to detect such types of variants accurately. However Indels, especially those close to each other, are still hard to detect accurately. Results We introduce a novel approach that leverages known variant information, e.g. provided by dbSNP, dbVar, ExAC or the 1000 Genomes Project, to improve sensitivity of detecting variants, especially close-by Indels. In our approach, the standard reference genome and the known variants are combined to build a meta-reference, which is expected to be probabilistically closer to the subject genomes than the standard reference. An alignment algorithm, which can take into account known variant information, is developed to accurately align reads to the meta-reference. This strategy resulted in accurate alignment and variant calling even with low coverage data. We showed that compared to popular methods such as GATK and SAMtools, our method significantly improves the sensitivity of detecting variants, especially Indels that are close to each other. In particular, our method was able to call these close-by Indels at a 15-20% higher sensitivity than other methods at low coverage, and still get 1-5% higher sensitivity at high coverage, at competitive precision. These results were validated using simulated data with variant profiles extracted from the 1000 Genomes Project data, and real data from the Illumina Platinum Genomes Project and ExAC database. Our finding suggests that by incorporating known variant information in an appropriate manner, sensitive variant calling is possible at a low cost. Availability and implementation Implementation can be found in our public code repository https://github.com/namsyvo/IVC. Supplementary information Supplementary data are available at Bioinformatics online.

Vinhthuy T. Phan | Nam S. Vo | N. S. Vo

[1] Ryan D. Hernandez,et al. A Fine-Scale Chimpanzee Genetic Map from Population Sequencing , 2012, Science.

[2] Andrei L. Turinsky,et al. The missing indels: an estimate of indel variation in a human genome and analysis of factors that impede detection , 2015, Nucleic acids research.

[3] Kai Ye,et al. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[4] Kenny Q. Ye,et al. An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[5] Xiaoqing Yu,et al. Comparing a few SNP calling algorithms using low-coverage sequencing data , 2013, BMC Bioinformatics.

[6] Chao Chen,et al. dbVar and DGVa: public archives for genomic structural variation , 2012, Nucleic Acids Res..

[7] N. Warthmann,et al. Simultaneous alignment of short reads against multiple genomes , 2009, Genome Biology.

[8] Jessica C. Ebert,et al. Computational Techniques for Human Genome Resequencing Using Mated Gapped Reads , 2012, J. Comput. Biol..

[9] Joel Gelernter,et al. Variant Callers for Next-Generation Sequencing Data: A Comparison Study , 2013, PloS one.

[10] James Y. Zou. Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[11] Gregory D. Schuler,et al. Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.