Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS

MOTIVATION The reliable detection of genomic variation in resequencing data is still a major challenge, especially for variants larger than a few base pairs. Sequencing reads crossing boundaries of structural variation carry the potential for their identification, but are difficult to map. RESULTS Here we present a method for 'split' read mapping, where prefix and suffix match of a read may be interrupted by a longer gap in the read-to-reference alignment. We use this method to accurately detect medium-sized insertions and long deletions with precise breakpoints in genomic resequencing data. Compared with alternative split mapping methods, SplazerS significantly improves sensitivity for detecting large indel events, especially in variant-rich regions. Our method is robust in the presence of sequencing errors as well as alignment errors due to genomic mutations/divergence, and can be used on reads of variable lengths. Our analysis shows that SplazerS is a versatile tool applicable to unanchored or single-end as well as anchored paired-end reads. In addition, application of SplazerS to targeted resequencing data led to the interesting discovery of a complete, possibly functional gene retrocopy variant. AVAILABILITY SplazerS is available from http://www.seqan.de/projects/ splazers. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Y. Xing,et al.  Detection of splice junctions from paired-end RNA-seq data by SpliceMap , 2010, Nucleic acids research.

[2]  Sebastian Bauer,et al.  Microindel detection in short-read sequence data , 2010, Bioinform..

[3]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[4]  Francisco M. De La Vega,et al.  Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. , 2009, Genome research.

[5]  D. Pinkel,et al.  Array comparative genomic hybridization and its applications in cancer , 2005, Nature Genetics.

[6]  Martin Vingron,et al.  Mapping translocation breakpoints by next-generation sequencing. , 2008, Genome research.

[7]  P. Stankiewicz,et al.  Structural variation in the human genome and its role in disease. , 2010, Annual review of medicine.

[8]  Emmanuel Barillot,et al.  SVDetect: a tool to identify genomic structural variations from paired-end and mate-pair sequencing data , 2010, Bioinform..

[9]  R. Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[10]  Michael Krawczak,et al.  The human gene mutation database , 1998, Nucleic Acids Res..

[11]  L. Feuk,et al.  Detection of large-scale variation in the human genome , 2004, Nature Genetics.

[12]  K. Reinert,et al.  RazerS--fast read mapping with sensitivity control. , 2009, Genome research.

[13]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[14]  C. Alkan,et al.  MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions , 2009, Nature Methods.

[15]  Steffen Lenzner,et al.  Mutations in the polyglutamine binding protein 1 gene cause X-linked mental retardation , 2003, Nature Genetics.

[16]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2006, J. Comput. Biol..

[17]  L. Feuk,et al.  Global and unbiased detection of splice junctions from RNA-seq data , 2010, Genome Biology.

[18]  Dustin E. Schones,et al.  High-Resolution Profiling of Histone Methylations in the Human Genome , 2007, Cell.

[19]  Chao Xie,et al.  CNV-seq, a new method to detect copy number variation using high-throughput sequencing , 2009, BMC Bioinformatics.

[20]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[21]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[22]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[23]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[24]  Philip L. F. Johnson,et al.  A Draft Sequence of the Neandertal Genome , 2010, Science.

[25]  R. Durbin,et al.  Dindel: accurate indel calls from short-read data. , 2011, Genome research.

[26]  Emily H Turner,et al.  Targeted Capture and Massively Parallel Sequencing of Twelve Human Exomes , 2009, Nature.

[27]  S. Nelson,et al.  Improved variant discovery through local re-alignment of short-read next-generation sequencing data using SRMA , 2010, Genome Biology.

[28]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[29]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[30]  S. Turner,et al.  Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[31]  Manuel Holtgrewe,et al.  Mason – A Read Simulator for Second Generation Sequencing Data , 2010 .

[32]  Kenny Q. Ye,et al.  Sensitive and accurate detection of copy number variants using read depth of coverage. , 2009, Genome research.

[33]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[34]  Serban Nacu,et al.  Fast and SNP-tolerant detection of complex variants and splicing in short reads , 2010, Bioinform..

[35]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[36]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[37]  Martin Vingron,et al.  q-gram based database searching using a suffix array (QUASAR) , 1999, RECOMB.

[38]  M. Stratton Exploring the Genomes of Cancer Cells: Progress and Promise , 2011, Science.

[39]  Kenny Q. Ye,et al.  Mapping copy number variation by population scale genome sequencing , 2010, Nature.

[40]  M. Gerstein,et al.  PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data , 2009, Genome Biology.

[41]  Eugene W. Myers,et al.  A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[42]  Ryan E. Mills,et al.  Small insertions and deletions (INDELs) in human genomes. , 2010, Human molecular genetics.

[43]  Jamie K Teer,et al.  Massively parallel sequencing of exons on the X chromosome identifies RBM10 as the gene that causes a syndromic form of cleft palate. , 2010, American journal of human genetics.

[44]  Matthias Platzer,et al.  Novel truncating mutations in the polyglutamine tract binding protein 1 gene (PQBP1) cause Renpenning syndrome and X-linked mental retardation in another family with microcephaly. , 2004, American journal of human genetics.

[45]  Ryan E. Mills,et al.  Natural genetic variation caused by small insertions and deletions in the human genome. , 2011, Genome research.

[46]  Paul Medvedev,et al.  Computational methods for discovering structural variation with next-generation sequencing , 2009, Nature Methods.

[47]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[48]  P. Stenson,et al.  The Human Gene Mutation Database: 2008 update , 2009, Genome Medicine.

[49]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.