Structural variation detection using next-generation sequencing data: A comparative technical review.

Structural variations (SVs) are mutations in the genome of size at least fifty nucleotides. They contribute to the phenotypic differences among healthy individuals, cause severe diseases and even cancers by breaking or linking genes. Thus, it is crucial to systematically profile SVs in the genome. In the past decade, many next-generation sequencing (NGS)-based SV detection methods have been proposed due to the significant cost reduction of NGS experiments and their ability to unbiasedly detect SVs to the base-pair resolution. These SV detection methods vary in both sensitivity and specificity, since they use different SV-property-dependent and library-property-dependent features. As a result, predictions from different SV callers are often inconsistent. Besides, the noises in the data (both platform-specific sequencing error and artificial chimeric reads) impede the specificity of SV detection. Poorly characterized regions in the human genome (e.g., repeat regions) greatly impact the reads mapping and in turn affect the SV calling accuracy. Calling of complex SVs requires specialized SV callers. Apart from accuracy, processing speed of SV caller is another factor deciding its usability. Knowing the pros and cons of different SV calling techniques and the objectives of the biological study are essential for biologists and bioinformaticians to make informed decisions. This paper describes different components in the SV calling pipeline and reviews the techniques used by existing SV callers. Through simulation study, we also demonstrate that library properties, especially insert size, greatly impact the sensitivity of different SV callers. We hope the community can benefit from this work both in designing new SV calling methods and in selecting the appropriate SV caller for specific biological studies.

[1]  Thomas M. Keane,et al.  RetroSeq: transposable element discovery from next-generation sequencing data , 2013, Bioinform..

[2]  G. Weinstock,et al.  TIGRA: A targeted iterative graph routing assembler for breakpoint assembly , 2014, Genome research.

[3]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[4]  Yingrui Li,et al.  SOAPindel: Efficient identification of indels from short paired reads , 2013, Genome research.

[5]  Ira M. Hall,et al.  YAHA: fast and flexible long-read alignment with optimal breakpoint detection , 2012, Bioinform..

[6]  W. Hahn,et al.  BreaKmer: detection of structural variation in targeted massively parallel sequencing data using kmers , 2014, Nucleic acids research.

[7]  Modesto Orozco,et al.  Comprehensive characterization of complex structural variations in cancer by directly comparing genome sequence reads , 2014, Nature Biotechnology.

[8]  Aaron R. Quinlan,et al.  Population-based structural variation discovery with Hydra-Multi , 2014, Bioinform..

[9]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[10]  Mauro Maggioni,et al.  Genomic Characterization of Large Heterochromatic Gaps in the Human Genome Assembly , 2014, PLoS Comput. Biol..

[11]  C. Ponting,et al.  Finishing the euchromatic sequence of the human genome , 2004 .

[12]  Fangqing Zhao,et al.  inGAP-sv: a novel scheme to identify and visualize structural variation from paired end mapping data , 2011, Nucleic Acids Res..

[13]  Martin Dugas,et al.  RSVSim: an R/Bioconductor package for the simulation of structural variations , 2013, Bioinform..

[14]  Knut Reinert,et al.  STELLAR: fast and exact local alignments , 2011, BMC Bioinformatics.

[15]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[16]  Andrei L. Turinsky,et al.  The missing indels: an estimate of indel variation in a human genome and analysis of factors that impede detection , 2015, Nucleic acids research.

[17]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[18]  Gilles Fischer,et al.  Ulysses: accurate detection of low-frequency structural variations in large insert-size sequencing libraries , 2015, Bioinform..

[19]  X. Estivill,et al.  PeSV-Fisher: Identification of Somatic and Non-Somatic Structural Variants Using Next Generation Sequencing Data , 2013, PloS one.

[20]  David Z. Chen,et al.  METHOD Open Access , 2014 .

[21]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[22]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[23]  M. Gerstein,et al.  PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data , 2009, Genome Biology.

[24]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[25]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[26]  Ilya Shmulevich,et al.  Fastbreak: a tool for analysis and visualization of structural variations in genomic data , 2012, EURASIP J. Bioinform. Syst. Biol..

[27]  Benjamin J. Raphael,et al.  An integrative probabilistic model for identification of structural variation in sequencing data , 2012, Genome Biology.

[28]  Kenny Q. Ye,et al.  Sensitive and accurate detection of copy number variants using read depth of coverage. , 2009, Genome research.

[29]  J. Opitz,et al.  Interstitial deletion of (17)(p11.2p11.2) in nine patients. , 1986, American journal of medical genetics.

[30]  Mark D. Johnson,et al.  Copy number variation detection in whole-genome sequencing data using the Bayesian information criterion , 2011, Proceedings of the National Academy of Sciences.

[31]  M. Rieder,et al.  Detection of structural variants and indels within exome data , 2011, Nature Methods.

[32]  Jan O. Korbel,et al.  Phenotypic impact of genomic structural variation: insights from and for human disease , 2013, Nature Reviews Genetics.

[33]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[34]  Zhen Yue,et al.  pIRS: Profile-based Illumina pair-end reads simulator , 2012, Bioinform..

[35]  Misko Dzamba,et al.  Detecting copy number variation with mated short reads. , 2010, Genome research.

[36]  Lovelace J. Luquette,et al.  Diverse Mechanisms of Somatic Structural Variations in Human Cancer Genomes , 2013, Cell.

[37]  Jing Li,et al.  Bellerophon: a hybrid method for detecting interchromo-somal rearrangements at base pair resolution using next-generation sequencing data , 2013, BMC Bioinformatics.

[38]  Martin Vingron,et al.  Breakpointer: using local mapping artifacts to support sequence breakpoint discovery from single-end reads , 2012, Bioinform..

[39]  Emmanuel Barillot,et al.  SVDetect: a tool to identify genomic structural variations from paired-end and mate-pair sequencing data , 2010, Bioinform..

[40]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[41]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[42]  Joshua M. Korn,et al.  Discovery and genotyping of genome structural polymorphism by sequencing on a population scale , 2011, Nature Genetics.

[43]  Knut Reinert,et al.  Gustaf: Detecting and correctly classifying SVs in the NGS twilight zone , 2014, Bioinform..

[44]  Russell J. Davenport,et al.  Removing Noise From Pyrosequenced Amplicons , 2011, BMC Bioinformatics.

[45]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[46]  Ryan M. Layer,et al.  LUMPY: a probabilistic framework for structural variant discovery , 2012, Genome Biology.

[47]  A. Sivachenko,et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples , 2013, Nature Biotechnology.

[48]  Jared T. Simpson,et al.  Copy number variant detection in inbred strains from short read sequence data , 2009, Bioinform..

[49]  Yann Joly,et al.  Data Sharing in the Post-Genomic World: The Experience of the International Cancer Genome Consortium (ICGC) Data Access Compliance Office (DACO) , 2012, PLoS Comput. Biol..

[50]  Seungtai Yoon,et al.  Detecting common copy number variants in high-throughput sequencing data by using JointSLM algorithm , 2011, Nucleic acids research.

[51]  Juliane C. Dohm,et al.  Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia , 2011, Nature.

[52]  Benjamin J. Raphael Chapter 6: Structural Variation and Medical Genomics , 2012, PLoS Comput. Biol..

[53]  M. Hayes,et al.  Bellerophon: a hybrid method for detecting interchromo-somal rearrangements at base pair resolution using next-generation sequencing data , 2013, BMC Bioinformatics.

[54]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[55]  J. Kitzman,et al.  Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing , 2009, Nature Genetics.

[56]  Martin Dugas,et al.  Robust and exact structural variation detection with paired-end and soft-clipped alignments: SoftSV compared with eight algorithms , 2016, Briefings Bioinform..

[57]  A. J. Jones,et al.  At Least 1 in 20 16S rRNA Sequence Records Currently Held in Public Repositories Is Estimated To Contain Substantial Anomalies , 2005, Applied and Environmental Microbiology.

[58]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[59]  M. K. Rudd,et al.  Human Structural Variation: Mechanisms of Chromosome Rearrangements. , 2015, Trends in genetics : TIG.

[60]  A. Nussenzweig,et al.  End-joining, translocations and cancer , 2013, Nature Reviews Cancer.

[61]  Jan Kieleczawa,et al.  Fundamentals of sequencing of difficult templates--an overview. , 2006, Journal of biomolecular techniques : JBT.

[62]  Eran Halperin,et al.  CNVeM: Copy Number Variation Detection Using Uncertainty of Read Mapping , 2013, J. Comput. Biol..

[63]  Paul Medvedev,et al.  Computational methods for discovering structural variation with next-generation sequencing , 2009, Nature Methods.

[64]  L. Feuk,et al.  Structural variation in the human genome , 2006, Nature Reviews Genetics.

[65]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[66]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[67]  Faraz Hach,et al.  Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery , 2010, Bioinform..

[68]  G. Pesole,et al.  SVM2: an improved paired-end-based tool for the detection of small genomic structural variations using high-throughput single-genome resequencing data , 2012, Nucleic acids research.

[69]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[70]  Haley J. Abel,et al.  SLOPE: a quick and accurate method for locating non-SNP structural variation from targeted next-generation sequence data , 2010, Bioinform..

[71]  Ira M. Hall,et al.  Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. , 2010, Genome research.

[72]  Noah Spies,et al.  svviz: a read viewer for validating structural variants , 2015, bioRxiv.

[73]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[74]  J. Tubío,et al.  Somatic structural variation and cancer. , 2015, Briefings in functional genomics.

[75]  Ali Bashir,et al.  A geometric approach for classification and comparison of structural variants , 2009, Bioinform..

[76]  M. Hayes,et al.  A Model-Based Clustering Method for Genomic Structural Variant Prediction and Genotyping Using Paired-End Sequencing Data , 2012, PloS one.

[77]  Michael R. Speicher,et al.  The new cytogenetics: blurring the boundaries with molecular biology , 2005, Nature Reviews Genetics.

[78]  S. Hochreiter,et al.  cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate , 2012, Nucleic acids research.

[79]  Hugo Y. K. Lam,et al.  Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library , 2010, Nature Biotechnology.

[80]  Inanç Birol,et al.  Detection and characterization of novel sequence insertions using paired-end next-generation sequencing , 2010, Bioinform..

[81]  Yves D'Aubenton-Carafa,et al.  CIRCUS: a package for Circos display of structural genome variations from paired-end and mate-pair sequencing data , 2014, BMC Bioinformatics.

[82]  Rajeev K. Varshney,et al.  Structural variations in plant genomes , 2014, Briefings in functional genomics.

[83]  M. Schatz,et al.  Accurate detection of de novo and transmitted indels within exome-capture data using micro-assembly , 2014, Nature Methods.

[84]  Tom Sante,et al.  ViVar: A Comprehensive Platform for the Analysis and Visualization of Structural Genomic Variation , 2014, PloS one.

[85]  Barbara J. Trask,et al.  Human genetics and disease: Human cytogenetics: 46 chromosomes, 46 years and counting , 2002, Nature Reviews Genetics.

[86]  Angela M. Liu,et al.  Genome-wide survey of recurrent HBV integration in hepatocellular carcinoma , 2012, Nature Genetics.

[87]  P. Nowell,et al.  Chromosome studies on normal and leukemic human leukocytes. , 1960, Journal of the National Cancer Institute.

[88]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[89]  B. Haas,et al.  Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. , 2011, Genome research.

[90]  Rob Knight,et al.  UCHIME improves sensitivity and speed of chimera detection , 2011, Bioinform..

[91]  Gabor T. Marth,et al.  SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications , 2012, PloS one.

[92]  C. Thermes,et al.  Ten years of next-generation sequencing technology. , 2014, Trends in genetics : TIG.

[93]  Jan Schröder,et al.  Socrates: identification of genomic rearrangements in tumour genomes by re-aligning soft clipped reads , 2014, Bioinform..

[94]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[95]  Steven J. M. Jones,et al.  Circos: an information aesthetic for comparative genomics. , 2009, Genome research.

[96]  Ge Gao,et al.  A brief introduction to web-based genome browsers , 2013, Briefings Bioinform..

[97]  Mark Gerstein,et al.  MetaSV: an accurate and integrative structural-variant caller for next generation sequencing , 2015, Bioinform..

[98]  Michael C. Rusch,et al.  CREST maps somatic structural variation in cancer genomes with base-pair resolution , 2011, Nature Methods.

[99]  Rafael A. Irizarry,et al.  Visualization and probability-based scoring of structural variants within repetitive sequences , 2014, Bioinform..

[100]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[101]  Guusje Bonnema,et al.  Making the difference: integrating structural variation detection tools , 2015, Briefings Bioinform..

[102]  Wing-Kin Sung,et al.  BatAlign: an incremental method for accurate alignment of sequencing reads , 2015, Nucleic acids research.

[103]  David H. Laidlaw,et al.  Online Submission ID : 1199 Gremlin : An Interactive Visualization Model for Analyzing Genomic Rearrangements , 2010 .

[104]  Yadong Wang,et al.  PRISM: Pair-read informed split-read mapping for base-pair level detection of insertion, deletion and structural variants , 2012, Bioinform..

[105]  C. Alkan,et al.  MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions , 2009, Nature Methods.

[106]  Ash A. Alizadeh,et al.  FACTERA: a practical method for the discovery of genomic rearrangements at breakpoint resolution , 2014, Bioinform..

[107]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[108]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[109]  Chao Xie,et al.  CNV-seq, a new method to detect copy number variation using high-throughput sequencing , 2009, BMC Bioinformatics.

[110]  Masao Nagasaki,et al.  ClipCrop: a tool for detecting structural variations with single-base resolution using soft-clipping information , 2011, BMC Bioinformatics.

[111]  Andy Wing Chun Pang,et al.  Mechanisms of Formation of Structural Variation in a Fully Sequenced Human Genome , 2013, Human mutation.

[112]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[113]  Thomas M. Keane,et al.  Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly , 2010, Genome Biology.

[114]  Derek Y. Chiang,et al.  High-resolution mapping of copy-number alterations with massively parallel sequencing , 2009, Nature Methods.

[115]  James D. Griffin,et al.  Mechanisms of Transformation by the BCR/ABL Oncogene , 2001, International journal of hematology.

[116]  Alexander Schliep,et al.  CLEVER: clique-enumerating variant finder , 2012, Bioinform..

[117]  E. Kristiansson,et al.  An open source chimera checker for the fungal ITS region , 2010, Molecular ecology resources.

[118]  Reinhard Schneider,et al.  Unraveling genomic variation from next generation sequencing data , 2013, BioData Mining.