iSVP: an integrated structural variant calling pipeline from high-throughput sequencing data

BackgroundStructural variations (SVs), such as insertions, deletions, inversions, and duplications, are a common feature in human genomes, and a number of studies have reported that such SVs are associated with human diseases. Although the progress of next generation sequencing (NGS) technologies has led to the discovery of a large number of SVs, accurate and genome-wide detection of SVs remains challenging. Thus far, various calling algorithms based on NGS data have been proposed. However, their strategies are diverse and there is no tool able to detect a full range of SVs accurately.ResultsWe focused on evaluating the performance of existing deletion calling algorithms for various spanning ranges from low- to high-coverage simulation data. The simulation data was generated from a whole genome sequence with artificial SVs constructed based on the distribution of variants obtained from the 1000 Genomes Project. From the simulation analysis, deletion calls of various deletion sizes were obtained with each caller, and it was found that the performance was quite different according to the type of algorithms and targeting deletion size. Based on these results, we propose an integrated structural variant calling pipeline (iSVP) that combines existing methods with a newly devised filtering and merging processes. It achieved highly accurate deletion calling with >90% precision and >90% recall on the 30× read data for a broad range of size. We applied iSVP to the whole-genome sequence data of a CEU HapMap sample, and detected a large number of deletions, including notable peaks around 300 bp and 6,000 bp, which corresponded to Alus and long interspersed nuclear elements, respectively. In addition, many of the predicted deletions were highly consistent with experimentally validated ones by other studies.ConclusionsWe present iSVP, a new deletion calling pipeline to obtain a genome-wide landscape of deletions in a highly accurate manner. From simulation and real data analysis, we show that iSVP is broadly applicable to human whole-genome sequencing data, which will elucidate relationships between SVs across genomes and associated diseases or biological functions.

[1]  Joseph T. Glessner,et al.  PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. , 2007, Genome research.

[2]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[3]  Masao Nagasaki,et al.  ClipCrop: a tool for detecting structural variations with single-base resolution using soft-clipping information , 2011, BMC Bioinformatics.

[4]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[5]  Ryan E. Mills,et al.  An initial map of insertion and deletion (INDEL) variation in the human genome. , 2006, Genome research.

[6]  Xinrui Li,et al.  Copy number variants in genetic susceptibility and severity of systemic lupus erythematosus , 2009, Cytogenetic and Genome Research.

[7]  F. Baas,et al.  High-resolution DNA Fiber-FISH for genomic DNA mapping and colour bar-coding of large genes. , 1995, Human molecular genetics.

[8]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[9]  Robert A Hegele,et al.  Genomic copy number variation and its potential role in lipoprotein and metabolic phenotypes , 2007, Current opinion in lipidology.

[10]  Tomas W. Fitzgerald,et al.  Origins and functional impact of copy number variation in the human genome , 2010, Nature.

[11]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[12]  Thomas M. Keane,et al.  Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly , 2010, Genome Biology.

[13]  H. Nagaraja,et al.  Phenotypes, genotypes and disease susceptibility associated with gene copy number variations: complement C4 CNVs in European American healthy subjects and those with systemic lupus erythematosus , 2009, Cytogenetic and Genome Research.

[14]  B. Windle,et al.  High resolution visual mapping of stretched DNA by fluorescent hybridization , 1993, Nature Genetics.

[15]  Kenny Q. Ye,et al.  Mapping copy number variation by population scale genome sequencing , 2010, Nature.

[16]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[17]  Dhavendra Kumar Disorders of the genome architecture: a review , 2008, Genomic Medicine.

[18]  D. Ovcharenko,et al.  Genomic deletion of a long-range bone enhancer misregulates sclerostin in Van Buchem disease. , 2005, Genome research.

[19]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[20]  J. Lupski,et al.  Molecular analysis of the Smith-Magenis syndrome: a possible contiguous-gene syndrome associated with del(17)(p11.2). , 1991, American journal of human genetics.

[21]  Uri Tabori,et al.  Excessive genomic DNA copy number variation in the Li–Fraumeni cancer predisposition syndrome , 2008, Proceedings of the National Academy of Sciences.

[22]  M. Nei,et al.  Genomic drift and evolution of microsatellite DNAs in human populations. , 2009, Molecular biology and evolution.

[23]  T. LaFramboise,et al.  Single nucleotide polymorphism arrays: a decade of biological, computational and technological advances , 2009, Nucleic acids research.

[24]  Joshua M. Korn,et al.  Mapping and sequencing of structural variation from eight human genomes , 2008, Nature.

[25]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[26]  W. Kuo,et al.  High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays , 1998, Nature Genetics.

[27]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[28]  D. Porteous Genetic causality in schizophrenia and bipolar disorder: out with the old and in with the new. , 2008, Current opinion in genetics & development.

[29]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[30]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.