论文信息 - deBGA: read alignment with de Bruijn graph-based seed and extension - 字舞流文

deBGA: read alignment with de Bruijn graph-based seed and extension

MOTIVATION As high-throughput sequencing (HTS) technology becomes ubiquitous and the volume of data continues to rise, HTS read alignment is becoming increasingly rate-limiting, which keeps pressing the development of novel read alignment approaches. Moreover, promising novel applications of HTS technology require aligning reads to multiple genomes instead of a single reference; however, it is still not viable for the state-of-the-art aligners to align large numbers of reads to multiple genomes. RESULTS We propose de Bruijn Graph-based Aligner (deBGA), an innovative graph-based seed-and-extension algorithm to align HTS reads to a reference genome that is organized and indexed using a de Bruijn graph. With its well-handling of repeats, deBGA is substantially faster than state-of-the-art approaches while maintaining similar or higher sensitivity and accuracy. This makes it particularly well-suited to handle the rapidly growing volumes of sequencing data. Furthermore, it provides a promising solution for aligning reads to multiple genomes and graph-based references in HTS applications. AVAILABILITY AND IMPLEMENTATION deBGA is available at: https://github.com/hitbc/deBGA CONTACT: ydwang@hit.edu.cnSupplementary information: Supplementary data are available at Bioinformatics online.

Yadong Wang | Bo Liu | Michael Brudno | Hongzhe Guo | M. Brudno | Yadong Wang | Bo Liu | Hongzhe Guo

[1] Vipin T. Sreedharan,et al. Multiple reference genomes and transcriptomes for Arabidopsis thaliana , 2011, Nature.

[2] J. Zook,et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[3] Daniel J. Wilson,et al. Transforming clinical microbiology with bacterial genome sequencing , 2012, Nature Reviews Genetics.

[4] Gabor T. Marth,et al. An integrated map of structural variation in 2,504 human genomes , 2015, Nature.

[5] A. Gnirke,et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[6] Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[7] Dawei Li,et al. The diploid genome sequence of an Asian individual , 2008, Nature.

[8] Roderic Guigó,et al. The GEM mapper: fast, accurate and versatile alignment by filtration , 2012, Nature Methods.

[9] Steven L. Salzberg,et al. Re-analysis of metagenomic sequences from acute flaccid myelitis patients reveals alternatives to enterovirus D68 infection , 2015, F1000Research.

[10] S. Rosset,et al. lobSTR: A short tandem repeat profiler for personal genomes , 2012, RECOMB.

[11] Adam M. Novak,et al. Mapping to a Reference Genome Structure , 2014, 1404.5010.

[12] Veli Mäkinen,et al. Indexing Graphs for Path Queries with Applications in Genome Research , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[13] Gil McVean,et al. Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[14] Wing Hung Wong,et al. Fast and accurate read alignment for resequencing , 2012, Bioinform..

[15] M. DePristo,et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[16] Thomas Zichner,et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[17] Steven L Salzberg,et al. Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[18] N. Warthmann,et al. Simultaneous alignment of short reads against multiple genomes , 2009, Genome Biology.

[19] S. Salzberg,et al. Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2012, Nature Reviews Genetics.

[20] R. Durbin,et al. Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[21] David A. Rasko,et al. Bacterial genome sequencing in the clinic: bioinformatic challenges and solutions , 2013, Nature Reviews Genetics.

[22] Lin Huang,et al. Short read alignment with populations of genomes , 2013, Bioinform..

[23] Veli Mäkinen,et al. On enhancing variation detection through pan-genome indexing , 2015, bioRxiv.

[24] Elizabeth M. Smigielski,et al. dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[25] Richard Durbin,et al. Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[26] Knut Reinert,et al. SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[27] H. Tettelin,et al. The microbial pan-genome. , 2005, Current opinion in genetics & development.

[28] Melissa J. Landrum,et al. RefSeq: an update on mammalian reference sequences , 2013, Nucleic Acids Res..

[29] Heng Li,et al. A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[30] Gad M. Landau,et al. Introducing efficient parallelism into approximate string matching and a new serial algorithm , 1986, STOC '86.

[31] Kenny Q. Ye,et al. An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[32] Knut Reinert,et al. Genome alignment with graph data structures: a comparison , 2014, BMC Bioinformatics.

[33] Richard Durbin,et al. Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[34] Thomas R. Gingeras,et al. STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[35] Benjamin J. Raphael,et al. A novel method for multiple alignment of sequences with repeated and shuffled elements. , 2004, Genome research.