论文信息 - Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS - 字舞流文

Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS

MOTIVATION The reliable detection of genomic variation in resequencing data is still a major challenge, especially for variants larger than a few base pairs. Sequencing reads crossing boundaries of structural variation carry the potential for their identification, but are difficult to map. RESULTS Here we present a method for 'split' read mapping, where prefix and suffix match of a read may be interrupted by a longer gap in the read-to-reference alignment. We use this method to accurately detect medium-sized insertions and long deletions with precise breakpoints in genomic resequencing data. Compared with alternative split mapping methods, SplazerS significantly improves sensitivity for detecting large indel events, especially in variant-rich regions. Our method is robust in the presence of sequencing errors as well as alignment errors due to genomic mutations/divergence, and can be used on reads of variable lengths. Our analysis shows that SplazerS is a versatile tool applicable to unanchored or single-end as well as anchored paired-end reads. In addition, application of SplazerS to targeted resequencing data led to the interesting discovery of a complete, possibly functional gene retrocopy variant. AVAILABILITY SplazerS is available from http://www.seqan.de/projects/ splazers. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Martin Vingron | Marcel H. Schulz | Knut Reinert | David Weese | Anne-Katrin Emde | Stefan A. Haas | Ruping Sun | Vera M. Kalscheuer | K. Reinert | D. Weese | Anne-Katrin Emde | S. Haas | M. Vingron | V. Kalscheuer | R. Sun | Ruping Sun

[1] Y. Xing,et al. Detection of splice junctions from paired-end RNA-seq data by SpliceMap , 2010, Nucleic acids research.

[2] Sebastian Bauer,et al. Microindel detection in short-read sequence data , 2010, Bioinform..

[3] Richard Durbin,et al. Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[4] Francisco M. De La Vega,et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. , 2009, Genome research.

[5] D. Pinkel,et al. Array comparative genomic hybridization and its applications in cancer , 2005, Nature Genetics.

[6] Martin Vingron,et al. Mapping translocation breakpoints by next-generation sequencing. , 2008, Genome research.

[7] P. Stankiewicz,et al. Structural variation in the human genome and its role in disease. , 2010, Annual review of medicine.

[8] Emmanuel Barillot,et al. SVDetect: a tool to identify genomic structural variations from paired-end and mate-pair sequencing data , 2010, Bioinform..

[9] R. Durbin,et al. Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[10] Michael Krawczak,et al. The human gene mutation database , 1998, Nucleic Acids Res..

[11] L. Feuk,et al. Detection of large-scale variation in the human genome , 2004, Nature Genetics.

[12] K. Reinert,et al. RazerS--fast read mapping with sensitivity control. , 2009, Genome research.

[13] Knut Reinert,et al. SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[14] C. Alkan,et al. MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions , 2009, Nature Methods.

[15] Steffen Lenzner,et al. Mutations in the polyglutamine binding protein 1 gene cause X-linked mental retardation , 2003, Nature Genetics.

[16] Eugene W. Myers,et al. Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2006, J. Comput. Biol..

[17] L. Feuk,et al. Global and unbiased detection of splice junctions from RNA-seq data , 2010, Genome Biology.

[18] Dustin E. Schones,et al. High-Resolution Profiling of Histone Methylations in the Human Genome , 2007, Cell.

[19] Chao Xie,et al. CNV-seq, a new method to detect copy number variation using high-throughput sequencing , 2009, BMC Bioinformatics.

[20] Derek Y. Chiang,et al. MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[21] J. Lupski,et al. The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[22] D. Altshuler,et al. A map of human genome variation from population-scale sequencing , 2010, Nature.

[23] M. DePristo,et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[24] Philip L. F. Johnson,et al. A Draft Sequence of the Neandertal Genome , 2010, Science.

[25] R. Durbin,et al. Dindel: accurate indel calls from short-read data. , 2011, Genome research.

[26] Emily H Turner,et al. Targeted Capture and Massively Parallel Sequencing of Twelve Human Exomes , 2009, Nature.

[27] S. Nelson,et al. Improved variant discovery through local re-alignment of short-read next-generation sequencing data using SRMA , 2010, Genome Biology.

[28] Bradley P. Coe,et al. Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[29] M. Metzker. Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[30] S. Turner,et al. Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[31] Manuel Holtgrewe,et al. Mason – A Read Simulator for Second Generation Sequencing Data , 2010 .

[32] Kenny Q. Ye,et al. Sensitive and accurate detection of copy number variants using read depth of coverage. , 2009, Genome research.

[33] M. Gerstein,et al. RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[34] Serban Nacu,et al. Fast and SNP-tolerant detection of complex variants and splicing in short reads , 2010, Bioinform..

[35] Elizabeth M. Smigielski,et al. dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[36] Kai Ye,et al. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[37] Martin Vingron,et al. q-gram based database searching using a suffix array (QUASAR) , 1999, RECOMB.

[38] M. Stratton. Exploring the Genomes of Cancer Cells: Progress and Promise , 2011, Science.

[39] Kenny Q. Ye,et al. Mapping copy number variation by population scale genome sequencing , 2010, Nature.

[40] M. Gerstein,et al. PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data , 2009, Genome Biology.

[41] Eugene W. Myers,et al. A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[42] Ryan E. Mills,et al. Small insertions and deletions (INDELs) in human genomes. , 2010, Human molecular genetics.

[43] Jamie K Teer,et al. Massively parallel sequencing of exons on the X chromosome identifies RBM10 as the gene that causes a syndromic form of cleft palate. , 2010, American journal of human genetics.

[44] Matthias Platzer,et al. Novel truncating mutations in the polyglutamine tract binding protein 1 gene (PQBP1) cause Renpenning syndrome and X-linked mental retardation in another family with microcephaly. , 2004, American journal of human genetics.

[45] Ryan E. Mills,et al. Natural genetic variation caused by small insertions and deletions in the human genome. , 2011, Genome research.

[46] Paul Medvedev,et al. Computational methods for discovering structural variation with next-generation sequencing , 2009, Nature Methods.

[47] Nancy F. Hansen,et al. Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[48] P. Stenson,et al. The Human Gene Mutation Database: 2008 update , 2009, Genome Medicine.

[49] R. Wilson,et al. BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.