A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

MOTIVATION Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm. RESULTS We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution. AVAILABILITY The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish.

[1]  J. Schwartz,et al.  Annotating large genomes with exact word matches. , 2003, Genome research.

[2]  Dana Randall Efficient Generation of Random Nonsingular Matrices , 1993, Random Struct. Algorithms.

[3]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[4]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..

[5]  Albert J. Vilella,et al.  Multi-Platform Next-Generation Sequencing of the Domestic Turkey (Meleagris gallopavo): Genome Assembly and Analysis , 2010, PLoS biology.

[6]  S. Kurtz,et al.  A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes , 2008, BMC Genomics.

[7]  References , 1971 .

[8]  Timothy L. Harris,et al.  Non-blocking Hashtables with Open Addressing , 2005, DISC.

[9]  Nir Shavit,et al.  An Optimistic Approach to Lock-Free FIFO Queues , 2004, DISC.

[10]  Maged M. Michael,et al.  High performance dynamic lock-free hash tables and list-based sets , 2002, SPAA '02.

[11]  Dawei Li,et al.  The sequence and de novo assembly of the giant panda genome , 2010, Nature.

[12]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[13]  Hui Gao,et al.  Almost wait-free resizable hashtables , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[14]  E. Mauceli,et al.  Whole-genome sequence assembly for mammalian genomes: Arachne 2. , 2003, Genome research.

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Suzanne S. Sindi,et al.  Duplication count distributions in DNA sequences. , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[17]  M. Schatz,et al.  Assembly of large genomes using second-generation sequencing. , 2010, Genome research.

[18]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[19]  Maged M. Michael,et al.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[20]  Giorgio Valle,et al.  BIOINFORMATICS ORIGINAL PAPER Sequence analysis RAP: a new computer program for de novo identification of repeated sequences in whole genomes , 2004 .

[21]  Arnaud Lefebvre,et al.  FORRepeats: detects repeats on entire chromosomes and between genomes , 2003, Bioinform..

[22]  Soumya Edamana Mana,et al.  Split-Ordered Lists : Lock-Free Extensible Hash Tables , 2011 .