Substring compression problems

We initiate a new class of string matching problems called Substring Compression Problems. Given a string S that may be preprocessed, the problem is to quickly find the compressed representation or the compressed size of any query substring of S (Substring Compression Query or SCQ) or to find the length l substring of S whose compression is the least (Least Compressible Substring or LCS problem).Starting from the seminal paper of Lempel and Ziv over 25 years ago, many different methods have emerged for compressing entire strings. Determining substring compressibility is a natural variant that is combinatorially and algorithmically challenging, yet surprisingly has not been studied before. In addition, compressibility of strings is emerging as a tool to compare biological sequences and analyze their information content. However, typically, the compressibility of the entire sequence is not as informative as that of portions of the sequences. Thus substring compressibility may be a more suitable basis for sequence analysis.We present the first known, nearly optimal algorithms for substring compression problems---SCQ, LCS and their generalizations---that are exact or provably approximate. Our exact algorithms exploit the structure in strings via suffix trees and our approximate algorithms rely on new relationships we find between Lempel-Ziv compression and string parsings.

[1]  Gary Benson,et al.  Let sleeping files lie: pattern matching in Z-compressed files , 1994, SODA '94.

[2]  Bin Ma,et al.  DNACompress: fast and effective DNA sequence compression , 2002, Bioinform..

[3]  Pankaj K. Agarwal Range Searching , 2004, Handbook of Discrete and Computational Geometry, 2nd Ed..

[4]  Uzi Vishkin,et al.  Efficient approximate and dynamic matching of patterns using a labeling paradigm , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[5]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[6]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[7]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[8]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[9]  Roberto Grossi,et al.  Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract) , 2000, STOC '00.

[10]  Jean-Paul Delahaye,et al.  Fast Discerning Repeats in DNA Sequences with a Compression Algorithm , 1997 .

[11]  Yossi Matias,et al.  Efficient Randomized Dictionary Matching Algorithms (Extended Abstract) , 1992, CPM.

[12]  Richard Cole,et al.  Deterministic coin tossing and accelerating cascades: micro and macro techniques for designing parallel algorithms , 1986, STOC '86.

[13]  Stephen Alstrup,et al.  New data structures for orthogonal range searching , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[14]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[15]  G SzymanskiThomas,et al.  Data compression via textual substitution , 1982 .

[16]  Michael Rodeh,et al.  Linear Algorithm for Data Compression via String Matching , 1981, JACM.

[17]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[18]  Graham Cormode,et al.  The string edit distance matching problem with moves , 2002, SODA '02.

[19]  Funda Ergün,et al.  Comparing Sequences with Segment Rearrangements , 2003, FSTTCS.

[20]  Stéphane Grumbach,et al.  A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[21]  Ian H. Witten,et al.  Protein is incompressible , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[22]  E. Rivalsy,et al.  Compression and Sequence Comparison , 1994 .

[23]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[24]  Arnold L. Rosenberg,et al.  Rapid identification of repeated patterns in strings, trees and arrays , 1972, STOC.

[25]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[26]  Jean-Paul Delahaye,et al.  Detection of significant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences , 1997, Comput. Appl. Biosci..

[27]  Moni Naor String Matching with Preprocessing of Text and Pattern , 1991, ICALP.

[28]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[29]  S. Muthukrishnan,et al.  Approximate nearest neighbors and sequence comparison with block operations , 2000, STOC '00.

[30]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[31]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[32]  Dana Shapira,et al.  Large Edit Distance with Multiple Block Operations , 2003, SPIRE.

[33]  Mikkel Thorup,et al.  String Matching in Lempel—Ziv Compressed Strings , 1998, Algorithmica.

[34]  Hiroshi Imai,et al.  Can General-Purpose Compression Schemes Really Compress DNA Sequences ? , 2000 .

[35]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[36]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[37]  Akihiko Konagaya,et al.  DNA Data Compression in the Post Genome Era , 2001 .