Mining, compressing and classifying with extensible motifs

BackgroundMotif patterns of maximal saturation emerged originally in contexts of pattern discovery in biomolecular sequences and have recently proven a valuable notion also in the design of data compression schemes. Informally, a motif is a string of intermittently solid and wild characters that recurs more or less frequently in an input sequence or family of sequences. Motif discovery techniques and tools tend to be computationally imposing, however, special classes of "rigid" motifs have been identified of which the discovery is affordable in low polynomial time.ResultsIn the present work, "extensible" motifs are considered such that each sequence of gaps comes endowed with some elasticity, whereby the same pattern may be stretched to fit segments of the source that match all the solid characters but are otherwise of different lengths. A few applications of this notion are then described. In applications of data compression by textual substitution, extensible motifs are seen to bring savings on the size of the codebook, and hence to improve compression. In germane contexts, in which compressibility is used in its dual role as a basis for structural inference and classification, extensible motifs are seen to support unsupervised classification and phylogeny reconstruction.ConclusionOff-line compression based on extensible motifs can be used advantageously to compress and classify biological sequences.

[1]  Abhi Shelat,et al.  Approximation algorithms for grammar-based compression , 2002, SODA '02.

[2]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[3]  Craig G. Nevill-Manning,et al.  Compression by induction of hierarchical grammars , 1994, Proceedings of IEEE Data Compression Conference (DCC'94).

[4]  James A. Storer,et al.  Data Compression: Methods and Theory , 1987 .

[5]  James A. Storer,et al.  On-Line Versus Off-Line Computation in Dynamic Text Compression , 1996, Inf. Process. Lett..

[6]  Matteo Comin,et al.  Motifs in Ziv-Lempel-Welch Clef , 2004, Data Compression Conference, 2004. Proceedings. DCC 2004.

[7]  Paul M. B. Vitányi,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1993, Graduate Texts in Computer Science.

[8]  William I. Gasarch,et al.  Book Review: An introduction to Kolmogorov Complexity and its Applications Second Edition, 1997 by Ming Li and Paul Vitanyi (Springer (Graduate Text Series)) , 1997, SIGACT News.

[9]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[10]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[11]  A. Apostolico,et al.  Off-line compression by greedy textual substitution , 2000, Proceedings of the IEEE.

[12]  Ian H. Witten,et al.  Protein is incompressible , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[13]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[14]  Laxmi Parida,et al.  An inexact-suffix-tree-based algorithm for detecting extensible patterns , 2005, Theor. Comput. Sci..

[15]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.