Storage and Retrieval of Highly Repetitive Sequence Collections

A repetitive sequence collection is a set of sequences which are small variations of each other. A prominent example are genome sequences of individuals of the same or close species, where the differences can be expressed by short lists of basic edit operations. Flexible and efficient data analysis on such a typically huge collection is plausible using suffix trees. However, the suffix tree occupies much space, which very soon inhibits in-memory analyses. Recent advances in full-text indexing reduce the space of the suffix tree to, essentially, that of the compressed sequences, while retaining its functionality with only a polylogarithmic slowdown. However, the underlying compression model considers only the predictability of the next sequence symbol given the k previous ones, where k is a small integer. This is unable to capture longer-term repetitiveness. For example, r identical copies of an incompressible sequence will be incompressible under this model. We develop new static and dynamic full-text indexes that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations. The new indexes can be plugged into a recent dynamic fully-compressed suffix tree, achieving full functionality for sequence analysis, while retaining the reduced space and the polylogarithmic slowdown. Our experimental results confirm the practicality of our proposal.

[1]  E. Pennisi Breakthrough of the year. Human genetic variation. , 2007, Science.

[2]  Rodrigo González,et al.  Improved Dynamic Rank-Select Entropy-Bound Structures , 2008, LATIN.

[3]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[4]  Roberto Grossi,et al.  Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract) , 2000, STOC '00.

[5]  Wing-Kai Hon,et al.  Compressed data structures: dictionaries and data-aware measures , 2006, Data Compression Conference (DCC'06).

[6]  E. Pennisi,et al.  Human Genetic Variation , 2007, Science.

[7]  Siu-Ming Yiu,et al.  A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays , 2002, COCOON.

[8]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[9]  Siu-Ming Yiu,et al.  Compressed indexing and local alignment of DNA , 2008, Bioinform..

[10]  Stéphane Grumbach,et al.  A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[11]  Jouni Sirén,et al.  Compressed Suffix Arrays for Massive Data , 2009, SPIRE.

[12]  Guy E. Blelloch,et al.  Compact representations of ordered sets , 2004, SODA '04.

[13]  Gonzalo Navarro,et al.  Fully compressed suffix trees , 2008, TALG.

[14]  Raffaele Giancarlo,et al.  Textual data compression in computational biology: a synopsis , 2009, Bioinform..

[15]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[16]  L. J. Korn,et al.  New approaches for computer analysis of nucleic acid sequences. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[18]  C. SIAMJ. LOW REDUNDANCY IN STATIC DICTIONARIES WITH CONSTANT QUERY TIME , 2001 .

[19]  George M. Church,et al.  Genomes for all. , 2006, Scientific American.

[20]  Thomas M. Cover,et al.  Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) , 2006 .

[21]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[22]  Gonzalo Navarro,et al.  An(other) Entropy-Bounded Compressed Suffix Tree , 2008, CPM.

[23]  Gonzalo Navarro,et al.  Dynamic Fully-Compressed Suffix Trees , 2008, CPM.

[24]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[25]  Wing-Kai Hon,et al.  Compressed indexes for dynamic text collections , 2007, TALG.

[26]  Gonzalo Navarro,et al.  Dynamic entropy-compressed sequences and full-text indexes , 2006, TALG.

[27]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[28]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[29]  Gonzalo Navarro,et al.  Succinct Suffix Arrays based on Run-Length Encoding , 2005, Nord. J. Comput..

[30]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[31]  Gonzalo Navarro,et al.  Advantages of Backward Searching - Efficient Secondary Memory and Distributed Implementation of Compressed Suffix Arrays , 2004, ISAAC.

[32]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[33]  Z. Galil,et al.  Combinatorial Algorithms on Words , 1985 .

[34]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[35]  Stéphane Grumbach,et al.  Compression of DNA sequences , 1993, [Proceedings] DCC `93: Data Compression Conference.

[36]  Kunsoo Park,et al.  Dynamic rank/select structures with applications to run-length encoded texts , 2009, Theor. Comput. Sci..

[37]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[38]  Sartaj Sahni,et al.  Handbook of Data Structures and Applications , 2004 .

[39]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[40]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[41]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[42]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[43]  Szymon Grabowski,et al.  Robust relative compression of genomes with random access , 2011, Bioinform..

[44]  Neil Hall,et al.  Advanced sequencing technologies and their wider impact in microbiology , 2007, Journal of Experimental Biology.

[45]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[46]  M. H. Overmars,et al.  Searching in the past I , 1981 .

[47]  Gonzalo Navarro,et al.  Implicit Compression Boosting with Applications to Self-indexing , 2007, SPIRE.

[48]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.