Bridging Lossy and Lossless Compression by Motif Pattern Discovery

Abstract We present data compression techniques hinged on the notion of a motif , interpreted here as a string of intermittently solid and wild characters that recurs more or less frequently in an input sequence or family of sequences. This notion arises originally in the analysis of sequences, particularly biomolecules, due to its multiple implications in the understanding of biological structure and function, and it has been the subject of various characterizations and study. Correspondingly, motif discovery techniques and tools have been devised. This task is made hard by the circumstance that the number of motifs identifiable in general in a sequence can be exponential in the size of that sequence. A significant gain in the direction of reducing the number of motifs is achieved through the introduction of irredundant motifs, which in intuitive terms are motifs of which the structure and list of occurrences cannot be inferred by a combination of other motifs' occurrences. Although suboptimal, the available procedure for the extraction of some such motifs are not prohibitively expensive. Here we show that irredundant motifs can be usefully exploited in lossy compression methods based on textual substitution and suitable for signals as well as text. Actually, once the motifs in our lossy encodings are disambiguated into corresponding lossless codebooks, they still prove capable of yielding savings over popular methods in use. Preliminary experiments with these fungible strategies at the crossroads of lossless and lossy data compression show performances that improve over popular methods (i.e. GZip) by more than 20% in lossy and 10% in lossless implementations.

[1]  Craig A. Stewart,et al.  Introduction to computational biology , 2005 .

[2]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[3]  James A. Storer,et al.  On-Line Versus Off-Line Computation in Dynamic Text Compression , 1996, Inf. Process. Lett..

[4]  Laxmi Parida,et al.  An Output-Sensitive Flexible Pattern Discovery Algorithm , 2001, CPM.

[5]  Vineet Bafna,et al.  Pattern Matching Algorithms , 1997 .

[6]  Yuan Gao,et al.  Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm , 2000, SODA '00.

[7]  Ioannis Kontoyiannis,et al.  An implementable lossy version of the Lempel-Ziv algorithm - Part I: Optimality for memoryless sources , 1999, IEEE Trans. Inf. Theory.

[8]  Taylor L. Booth,et al.  Grammatical Inference: Introduction and Survey-Part II , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[10]  I. Sadeh,et al.  On approximate string matching , 1993, [Proceedings] DCC `93: Data Compression Conference.

[11]  I. Rigoutsos,et al.  The emergence of pattern discovery techniques in computational biology. , 2000, Metabolic engineering.

[12]  Dennis Shasha,et al.  Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications , 1999 .

[13]  Taylor L. Booth,et al.  Grammatical Inference: Introduction and Survey-Part I , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Toby Berger,et al.  Rate distortion theory : a mathematical basis for data compression , 1971 .

[15]  Z. Galil,et al.  Pattern matching algorithms , 1997 .

[16]  Abhi Shelat,et al.  Approximation algorithms for grammar-based compression , 2002, SODA '02.

[17]  A. Apostolico,et al.  Off-line compression by greedy textual substitution , 2000, Proceedings of the IEEE.

[18]  Wojciech Szpankowski,et al.  A suboptimal lossy data compression based on approximate pattern matching , 1997, IEEE Trans. Inf. Theory.

[19]  David Haussler,et al.  The Smallest Automaton Recognizing the Subwords of a Text , 1985, Theor. Comput. Sci..

[20]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[21]  John C. Kieffer,et al.  A survey of the theory of source coding with a fidelity criterion , 1993, IEEE Trans. Inf. Theory.

[22]  Alberto Apostolico,et al.  Robust transmission of unbounded strings using Fibonacci representations , 1987, IEEE Trans. Inf. Theory.

[23]  Mikhail J. Atallah,et al.  Compact Recognizers of Episode Sequences , 2002, Inf. Comput..

[24]  Matteo Comin,et al.  Motifs in Ziv-Lempel-Welch Clef , 2004, Data Compression Conference, 2004. Proceedings. DCC 2004.

[25]  Mikhail J. Atallah,et al.  Pattern Matching Image Compression: Algorithmic and Empirical Results , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[27]  Alberto Apostolico,et al.  Incremental Paradigms of Motif Discovery , 2004, J. Comput. Biol..

[28]  Franco P. Preparata,et al.  Data structures and algorithms for the string statistics problem , 1996, Algorithmica.

[29]  Craig G. Nevill-Manning,et al.  Compression by induction of hierarchical grammars , 1994, Proceedings of IEEE Data Compression Conference (DCC'94).

[30]  Alberto Apostolico,et al.  Compression and the wheel of fortune , 2003, Data Compression Conference, 2003. Proceedings. DCC 2003.

[31]  Toby Berger,et al.  Lossy Source Coding , 1998, IEEE Trans. Inf. Theory.

[32]  James A. Storer,et al.  Data Compression: Methods and Theory , 1987 .