Indexing Arbitrary-Length k-Mers in Sequencing Reads

We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating k-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments.

[1]  Jiajie Zhang,et al.  PEAR: a fast and accurate Illumina Paired-End reAd mergeR , 2013, Bioinform..

[2]  Jan Schröder,et al.  Genome analysis SHREC : a short-read error correction method , 2009 .

[3]  Lucian Ilie,et al.  Correcting Illumina data , 2015, Briefings Bioinform..

[4]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[5]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[6]  E. Hayden Is the $1,000 genome for real? , 2014 .

[7]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[8]  Szymon Grabowski,et al.  Disk-based k-mer counting on a PC , 2012, BMC Bioinformatics.

[9]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[10]  Thierry Lecroq,et al.  Querying large read collections in main memory: a versatile data structure , 2011, BMC Bioinformatics.

[11]  Dominique Lavenier,et al.  DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[12]  Erika Check Hayden Is the $1,000 genome for real? , 2014, Nature.

[13]  Leena Salmela,et al.  Correction of sequencing errors in a mixed set of reads , 2010, Bioinform..

[14]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[15]  Maya Gokhale,et al.  Scalable metagenomic taxonomy classification using a reference genome database , 2013, Bioinform..

[16]  Niko Välimäki,et al.  Scalable and Versatile k-mer Indexing for High-Throughput Sequencing Data , 2013, ISBRA.

[17]  Nikhil Bhat,et al.  Million Veteran Program , 2015 .

[18]  Lucian Ilie,et al.  RACER: Rapid and accurate correction of errors in reads , 2013, Bioinform..

[19]  Juha Kärkkäinen,et al.  Sparse Suffix Trees , 1996, COCOON.

[20]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[21]  Eric Rivals,et al.  CRAC: an integrated approach to the analysis of RNA-seq reads , 2013, Genome Biology.

[22]  KingsfordCarl,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011 .

[23]  Szymon Grabowski,et al.  Disk-based compression of data from genome sequencing , 2015, Bioinform..

[24]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[25]  Marcel H. Schulz,et al.  Fiona: a parallel and automatic strategy for read error correction , 2014, Bioinform..

[26]  Michael P. Cummings,et al.  A comparative evaluation of sequence classification programs , 2012, BMC Bioinformatics.

[27]  Szymon Grabowski,et al.  Indexes of Large Genome Collections on a PC , 2014, PloS one.

[28]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[29]  Szymon Grabowski,et al.  Sampling the Suffix Array with Minimizers , 2015, SPIRE.

[30]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[31]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[32]  Xiaolong Wu,et al.  BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..

[33]  Steven L Salzberg,et al.  DIAMUND: Direct Comparison of Genomes to Detect Mutations , 2013, Human mutation.

[34]  Vitaly Osipov,et al.  Inducing Suffix and LCP Arrays in External Memory , 2013, ALENEX.

[35]  Robert Bossy,et al.  BioNLP Shared Task - The Bacteria Track , 2012, BMC Bioinformatics.

[36]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.