论文信息 - A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

MOTIVATION Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm. RESULTS We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution. AVAILABILITY The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish.

Carl Kingsford | Guillaume Marçais | G. Marçais | Carl Kingsford

[1] J. Schwartz,et al. Annotating large genomes with exact word matches. , 2003, Genome research.

[2] Dana Randall. Efficient Generation of Random Nonsingular Matrices , 1993, Random Struct. Algorithms.

[3] David R. Kelley,et al. Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[4] Sergey Koren,et al. Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..

[5] Albert J. Vilella,et al. Multi-Platform Next-Generation Sequencing of the Domestic Turkey (Meleagris gallopavo): Genome Assembly and Analysis , 2010, PLoS biology.

[6] S. Kurtz,et al. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes , 2008, BMC Genomics.

[7] References , 1971 .

[8] Timothy L. Harris,et al. Non-blocking Hashtables with Open Addressing , 2005, DISC.

[9] Nir Shavit,et al. An Optimistic Approach to Lock-Free FIFO Queues , 2004, DISC.

[10] Maged M. Michael,et al. High performance dynamic lock-free hash tables and list-based sets , 2002, SPAA '02.

[11] Dawei Li,et al. The sequence and de novo assembly of the giant panda genome , 2010, Nature.

[12] Robert C. Edgar,et al. MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[13] Hui Gao,et al. Almost wait-free resizable hashtables , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[14] E. Mauceli,et al. Whole-genome sequence assembly for mammalian genomes: Arachne 2. , 2003, Genome research.

[15] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16] Suzanne S. Sindi,et al. Duplication count distributions in DNA sequences. , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[17] M. Schatz,et al. Assembly of large genomes using second-generation sequencing. , 2010, Genome research.

[18] Eugene W. Myers,et al. A whole-genome assembly of Drosophila. , 2000, Science.

[19] Maged M. Michael,et al. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[20] Giorgio Valle,et al. BIOINFORMATICS ORIGINAL PAPER Sequence analysis RAP: a new computer program for de novo identification of repeated sequences in whole genomes , 2004 .

[21] Arnaud Lefebvre,et al. FORRepeats: detects repeats on entire chromosomes and between genomes , 2003, Bioinform..

[22] Soumya Edamana Mana,et al. Split-Ordered Lists : Lock-Free Extensible Hash Tables , 2011 .