SVEM: A Structural Variant Estimation Method Using Multi-mapped Reads on Breakpoints

Recent development of next generation sequencing (NGS) technologies has led to the identification of structural variants (SVs) of genomic DNA existing in the human population. Several SV detection methods utilizing NGS data have been proposed. However, there are several difficulties in analysis of NGS data, particularly with regard to handling reads from duplicated loci or low-complexity sequences of the human genome. In this paper, we propose SVEM, a novel statistical method to detect SVs with a single nucleotide resolution that can utilize multi-mapped reads on breakpoints. SVEM estimates the amount of reads on breakpoints as parameters and mapping states as latent variables using the expectation maximization algorithm. This framework enables us to handle ambiguous mapping of reads without discarding information for SV detection. SVEM is applied to simulation data and real data, and it achieves better performance than existing methods in terms of precision and recall.

[1]  Masao Nagasaki,et al.  TIGAR: transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference , 2013, Bioinform..

[2]  E. Hoogendoorn Computational methods for the detection of structural variation in the human genome , 2012 .

[3]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[4]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[5]  M. Daly,et al.  A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms , 2001, Nature.

[6]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[7]  Michael Egmont-Petersen,et al.  Genome-wide Copy Number Profiling on High-density Bacterial Artificial Chromosomes, Single-nucleotide Polymorphisms, and Oligonucleotide Microarrays: A Platform Comparison based on Statistical Power Analysis , 2007, DNA research : an international journal for rapid publication of reports on genes and genomes.

[8]  Masao Nagasaki,et al.  ClipCrop: a tool for detecting structural variations with single-base resolution using soft-clipping information , 2011, BMC Bioinformatics.

[9]  Leslie G Biesecker,et al.  Consensus statement: chromosomal microarray is a first-tier clinical diagnostic test for individuals with developmental disabilities or congenital anomalies. , 2010, American journal of human genetics.

[10]  Kenny Q. Ye,et al.  Mapping copy number variation by population scale genome sequencing , 2010, Nature.

[11]  Ambuj K. Singh,et al.  Using Stochastic Causal Trees to Augment Bayesian Networks for Modeling eQTL Datasets , 2011, BMC Bioinformatics.

[12]  Colin N. Dewey,et al.  RNA-Seq gene expression estimation with read mapping uncertainty , 2009, Bioinform..

[13]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[14]  Akira Ono,et al.  iSVP: an integrated structural variant calling pipeline from high-throughput sequencing data , 2013, BMC Systems Biology.

[15]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[16]  L. Feuk,et al.  Structural variation in the human genome , 2006, Nature Reviews Genetics.

[17]  Joseph A. Gogos,et al.  Strong association of de novo copy number mutations with sporadic schizophrenia , 2008, Nature Genetics.

[18]  Masao Nagasaki,et al.  A statistical variant calling approach from pedigree information and local haplotyping with phase informative reads , 2013, Bioinform..

[19]  W. Kuo,et al.  High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays , 1998, Nature Genetics.

[20]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[21]  T. Hubbard,et al.  A census of human cancer genes , 2004, Nature Reviews Cancer.

[22]  Eric S. Lander,et al.  Human genome sequence variation and the influence of gene history, mutation and recombination , 2002, Nature Genetics.

[23]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[24]  E. Eichler,et al.  Fine-scale structural variation of the human genome , 2005, Nature Genetics.

[25]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[26]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.