QGramProjector: Q-Gram Projection for Indexing Highly-Similar Strings

Q-gram or n-gram, k-mer models are used in many research areas, e.g. in computational linguistics for statistical natural language processing, in computer science for approximate string searching, and in computational biology for sequence analysis and data compression. For a collection of N strings, one usually creates a separate positional q-gram index structure for each string, or at least an index structure which needs roughly N times of storage compared to a single string index structure. For highly-similar strings, redundancies can be identified, which do not need to be stored repeatedly; for instance two human genomes have more than 99 percent similarity. In this work, we propose QGramProjector, a new way of indexing many highly-similar strings. In order to remove the redundancies caused by similarities, our proposal is to 1 create all q-grams for a fixed reference, 2 referentially compress all strings in the collection with respect to the reference, and then 3 project all q-grams from the reference to the compressed strings. Experiments show that a complete index can be relatively small compared to the collection of highly-similar strings. For a collection of 1092 human genomes raw data size is 3 TB, a 16-gram index structure, which can be used for instance as a basis for multi-genome read alignment, only needs 100.5 GB compression ratio of 31:1. We think that our work is an important step towards analysis of large sets of highly-similar genomes on commodity hardware.

[1]  Rossano Venturini,et al.  Compressed String Dictionary Look-Up with Edit Distance One , 2012, CPM.

[2]  Gonzalo Navarro,et al.  Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .

[3]  Eric S. Lander,et al.  Human genome sequence variation and the influence of gene history, mutation and recombination , 2002, Nature Genetics.

[4]  Gonzalo Navarro,et al.  On compressing and indexing repetitive sequences , 2013, Theor. Comput. Sci..

[5]  Szymon Grabowski,et al.  Robust relative compression of genomes with random access , 2011, Bioinform..

[6]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[7]  Paul G. Spirakis,et al.  Algorithms — ESA '95 , 1995, Lecture Notes in Computer Science.

[8]  Wojciech Rytter,et al.  Extracting Powers and Periods in a String from Its Runs Structure , 2010, SPIRE.

[9]  Cédric du Mouza,et al.  AS-index: a structure for string search using n-grams and algebraic signatures , 2009, CIKM.

[10]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[11]  Man Lung Yiu,et al.  Group-by skyline query processing in relational engines , 2009, CIKM.

[12]  Wojciech Rytter,et al.  Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2002, Theor. Comput. Sci..

[13]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[14]  Paolo Ferragina,et al.  Indexing compressed text , 2005, JACM.

[15]  Erkki Sutinen,et al.  On Using q-Gram Locations in Approximate String Matching , 1995, ESA.

[16]  Ulf Leser,et al.  Adaptive efficient compression of genomes , 2012, Algorithms for Molecular Biology.

[17]  Paolo Ferragina String algorithms and data structures , 2008, ArXiv.

[18]  Hideo Bannai,et al.  Speeding Up q-Gram Mining on Grammar-Based Compressed Texts , 2012, CPM.

[19]  Ricardo A. Baeza-Yates,et al.  Fast and Practical Approximate String Matching , 1992, Inf. Process. Lett..

[20]  Gonzalo Navarro,et al.  Indexing Highly Repetitive Collections , 2012, IWOCA.

[21]  Justin Zobel,et al.  Optimized Relative Lempel-Ziv Compression of Genomes , 2011, ACSC.

[22]  R. Mott,et al.  The 1001 Genomes Project for Arabidopsis thaliana , 2009, Genome Biology.

[23]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.