Compressed q-Gram Indexing for Highly Repetitive Biological Sequences

The study of compressed storage schemes for highly repetitive sequence collections has been recently boosted by the availability of cheaper sequencing technologies and the flood of data they promise to generate. Such a storage scheme may range from the simple goal of retrieving whole individual sequences to the more advanced one of providing fast searches in the collection. In this paper we study alternatives to implement a particularly popular index, namely, the one able of finding all the positions in the collection of substrings of fixed length ($q$-grams). We introduce two novel techniques and show they constitute practical alternatives to handle this scenario. They excel particularly in two cases: when $q$ is small (up to 6), and when the collection is extremely repetitive (less than 0.01% mutations).

[1]  Jean-Paul Delahaye,et al.  A guaranteed compression scheme for repetitive DNA sequences , 1996, Proceedings of Data Compression Conference - DCC '96.

[2]  Gonzalo Navarro,et al.  Practical Rank/Select Queries over Arbitrary Sequences , 2008, SPIRE.

[3]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[4]  Trevor I. Dix,et al.  A Simple Statistical Algorithm for Biological Sequence Compression , 2007, 2007 Data Compression Conference (DCC'07).

[5]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[6]  E. Pennisi Breakthrough of the year. Human genetic variation. , 2007, Science.

[7]  William F. Smyth,et al.  Inverted Files Versus Suffix Arrays for Locating Patterns in Primary Memory , 2006, SPIRE.

[8]  R. Nigel Horspool,et al.  Practical fast searching in strings , 1980, Softw. Pract. Exp..

[9]  Gonzalo Navarro,et al.  Self-indexed Text Compression Using Straight-Line Programs , 2009, MFCS.

[10]  E. Pennisi,et al.  Human Genetic Variation , 2007, Science.

[11]  Giovanni Manzini,et al.  A simple and fast DNA compressor , 2004, Softw. Pract. Exp..

[12]  Ian H. Witten,et al.  Protein is incompressible , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[13]  Alistair Moffat,et al.  Off-line dictionary-based compression , 2000 .

[14]  George Kingsley Zipf,et al.  Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology , 2012 .

[15]  George M. Church,et al.  Genomes for all. , 2006, Scientific American.

[16]  Gonzalo Navarro,et al.  Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .

[17]  Gonzalo Navarro,et al.  Storage and Retrieval of Individual Genomes , 2009, RECOMB.

[18]  Rodrigo González,et al.  Compressed Text Indexes with Fast Locate , 2007, CPM.

[19]  Rajeev Raman,et al.  Succinct Representations of Permutations , 2003, ICALP.

[20]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[21]  Hiroshi Sakamoto,et al.  A fully linear-time approximation algorithm for grammar-based compression , 2003, J. Discrete Algorithms.

[22]  Neil Hall,et al.  Advanced sequencing technologies and their wider impact in microbiology , 2007, Journal of Experimental Biology.

[23]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[24]  Stéphane Grumbach,et al.  A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[25]  Akihiko Konagaya,et al.  DNA Data Compression in the Post Genome Era , 2001 .