Sequence Comparisons via Algorithmic Mutual Information

One of the main problems in DNA and protein sequence comparisons is to decide whether observed similarity of two sequences should be explained by their relatedness or by mere presence of some shared internal structure, e.g., shared internal tandem repeats. The standard methods that are based on statistics or classical information theory can be used to discover either internal structure or mutual sequence similarity, but cannot take into account both. Consequently, currently used methods for sequence comparison employ "masking" techniques that simply eliminate sequences that exhibit internal repetitive structure prior to sequence comparisons. The "masking" approach precludes discovery of homologous sequences of moderate or low complexity, which abound at both DNA and protein levels. As a solution to this problem, we propose a general method that is based on algorithmic information theory and minimal length encoding. We show that algorithmic mutual information factors out the sequence similarity that is due to shared internal structure and thus enables discovery of truly related sequences. We extend that recently developed algorithmic significance method (Milosavljević & Jurka 1993) to show that significance depends exponentially on algorithmic mutual information.

[1]  L. Allison,et al.  Minimum message length encoding and the comparison of macromolecules. , 1990, Bulletin of Mathematical Biology.

[2]  Aleksandar Milosavljevic,et al.  Discovering Sequence Similarity by the Algorithmic Significance Method , 1993, ISMB.

[3]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[4]  S. Karlin,et al.  Chance and statistical significance in protein and DNA sequence analysis. , 1992, Science.

[5]  G. Chaitin,et al.  TOWARD A MATHEMATICAL DEFINITION OF “ LIFE ” , 1979 .

[6]  James A. Storer,et al.  Data Compression: Methods and Theory , 1987 .

[7]  David Haussler,et al.  The Smallest Automaton Recognizing the Subwords of a Text , 1985, Theor. Comput. Sci..

[8]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[9]  I. Good,et al.  The Maximum Entropy Formalism. , 1979 .

[10]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[11]  E. Chen,et al.  The human growth hormone locus: nucleotide sequence, biology, and evolution. , 1989, Genomics.

[12]  Jean-Michel Claverie,et al.  Information Enhancement Methods for Large Scale Sequence Analysis , 1993, Comput. Chem..

[13]  Lloyd Allison,et al.  Reconstruction of strings past , 1993, Comput. Appl. Biosci..

[14]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[15]  Aleksandar Milosavljevic,et al.  Discovering simple DNA sequences by the algorithmic significance method , 1993, Comput. Appl. Biosci..

[16]  I. Good,et al.  The Maximum Entropy Formalism. , 1979 .