Seeksv: an accurate tool for somatic structural variation and virus integration detection

Motivation: Many forms of variations exist in the human genome including single nucleotide polymorphism, small insert/deletion (DEL) (indel) and structural variation (SV). Somatically acquired SV may regulate the expression of tumor-related genes and result in cell proliferation and uncontrolled growth, eventually inducing tumor formation. Virus integration with host genome sequence is a type of SV that causes the related gene instability and normal cells to transform into tumor cells. Cancer SVs and viral integration sites must be discovered in a genome-wide scale for clarifying the mechanism of tumor occurrence and development. Results: In this paper, we propose a new tool called seeksv to detect somatic SVs and viral integration events. Seeksv simultaneously uses split read signal, discordant paired-end read signal, read depth signal and the fragment with two ends unmapped. Seeksv can detect DEL, insertion, inversion and inter-chromosome transfer at single-nucleotide resolution. Different types of sequencing data, such as single-end sequencing data or paired-end sequencing data can accommodate to detect SV. Seeksv develops a rescue model for SV with breakpoints located in sequence homology regions. Results on simulated and real data from the 1000 Genomes Project and esophageal squamous cell carcinoma samples show that seeksv has higher efficiency and precision compared with other similar software in detecting SVs. For the discovery of hepatitis B virus integration sites from probe capture data, the verified experiments show that more than 90% viral integration sequences detected by seeksv are true. Availability and Implementation: seeksv is implemented in C ++ and can be downloaded from https://github.com/qkl871118/seeksv. Contact: dragonbw@163.com Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[2]  S. Hochreiter,et al.  cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate , 2012, Nucleic acids research.

[3]  G. Weinstock,et al.  TIGRA: A targeted iterative graph routing assembler for breakpoint assembly , 2014, Genome research.

[4]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[5]  M. Gerstein,et al.  PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data , 2009, Genome Biology.

[6]  Yi Pan,et al.  Sprites: detection of deletions from sequencing data by re-aligning split reads , 2016, Bioinform..

[7]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[8]  N. Hu,et al.  Genomic Landscape of Somatic Alterations in Esophageal Squamous Cell Carcinoma and Gastric Cancer. , 2016, Cancer research.

[9]  Chao Xie,et al.  CNV-seq, a new method to detect copy number variation using high-throughput sequencing , 2009, BMC Bioinformatics.

[10]  Fangqing Zhao,et al.  inGAP-sv: a novel scheme to identify and visualize structural variation from paired end mapping data , 2011, Nucleic Acids Res..

[11]  Faraz Hach,et al.  Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery , 2010, Bioinform..

[12]  Z. Weng,et al.  Local sequence assembly reveals a high-resolution profile of somatic structural variations in 97 cancer genomes , 2015, Nucleic acids research.

[13]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[14]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[15]  Michael C. Rusch,et al.  CREST maps somatic structural variation in cancer genomes with base-pair resolution , 2011, Nature Methods.

[16]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[17]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[18]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[19]  Ali Bashir,et al.  A geometric approach for classification and comparison of structural variants , 2009, Bioinform..

[20]  Mark Gerstein,et al.  AGE: defining breakpoints of genomic structural variants at single-nucleotide resolution, through optimal alignments with gap excision , 2011, Bioinform..

[21]  Evangelos Bellos,et al.  cnvHiTSeq: integrative models for high-resolution copy number variation detection and genotyping using population sequencing data , 2012, Genome Biology.

[22]  Wolfgang Losert,et al.  svclassify: a method to establish benchmark structural variant calls , 2015, BMC Genomics.

[23]  A. McKenna,et al.  Absolute quantification of somatic DNA alterations in human cancer , 2012, Nature Biotechnology.

[24]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[25]  Lovelace J. Luquette,et al.  Diverse Mechanisms of Somatic Structural Variations in Human Cancer Genomes , 2013, Cell.

[26]  Jing Liu,et al.  Whole-Genome Sequencing Reveals Diverse Models of Structural Variations in Esophageal Squamous Cell Carcinoma , 2016, American journal of human genetics.

[27]  Yingrui Li,et al.  SOAPindel: Efficient identification of indels from short paired reads , 2013, Genome research.

[28]  Huanming Yang,et al.  Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly , 2011, Nature Biotechnology.

[29]  Siu-Ming Yiu,et al.  COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly , 2012, Bioinform..

[30]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[31]  Lars Feuk,et al.  The Database of Genomic Variants: a curated collection of structural variation in the human genome , 2013, Nucleic Acids Res..

[32]  Benjamin J. Raphael,et al.  An integrative probabilistic model for identification of structural variation in sequencing data , 2012, Genome Biology.

[33]  Jian Sun,et al.  Genetic landscape of esophageal squamous cell carcinoma , 2014, Nature Genetics.

[34]  P. Sullivan,et al.  Improving detection of copy-number variation by simultaneous bias correction and read-depth segmentation , 2012, Nucleic acids research.

[35]  Monya Baker,et al.  Structural variation: the genome's hidden architecture , 2012, Nature Methods.

[36]  Yadong Wang,et al.  PRISM: Pair-read informed split-read mapping for base-pair level detection of insertion, deletion and structural variants , 2012, Bioinform..