Discovering dependencies via algorithmic mutual information: A case study in DNA sequence comparisons

Algorithmic mutual information is a central concept in algorithmic information theory and may be measured as the difference between independent and joint minimal encoding lengths of objects; it is also a central concept in Chaitin's fascinating mathematical definition of life. We explore applicability of algorithmic mutual information as a tool for discovering dependencies in biology. In order to determine significance of discovered dependencies, we extend the newly proposed algorithmic significance method. The main theorem of the extended method states thatd bits of algorithmic mutual information imply dependency at the significance level 2−d+O(1). We apply a heuristic version of the method to one of the main problems in DNA and protein sequence comparisons: the problem of deciding whether observed similarity between sequences should be explained by their relatedness or by the mere presence of some shared internal structure, e.g., shared internal repetitive patterns. We take advantage of the fact that mutual information factors out sequence similarity that is due to shared internal structure and thus enables discovery of truly related sequences. In addition to providing a general framework for sequence comparisons, we also propose an efficient way to compare sequences based on their subword composition that does not require any a priori assumptions about k-tuple length.

[1]  Pavel A. Pevzner,et al.  Statistical distance between texts and filtration methods in sequence comparison , 1992, Comput. Appl. Biosci..

[2]  Aleksandar Milosavljevic,et al.  Discovering simple DNA sequences by the algorithmic significance method , 1993, Comput. Appl. Biosci..

[3]  Aleksandar Milosavljevic,et al.  Discovering Sequence Similarity by the Algorithmic Significance Method , 1993, ISMB.

[4]  C Burks,et al.  The GenBank genetic sequence data bank. , 1988, Nucleic acids research.

[5]  Gregory. J. Chaitin,et al.  Algorithmic information theory , 1987, Cambridge tracts in theoretical computer science.

[6]  Gregory J. Chaitin,et al.  Information, Randomness and Incompleteness - Papers on Algorithmic Information Theory; 2nd Edition , 1987, World Scientific Series in Computer Science.

[7]  I. Good,et al.  The Maximum Entropy Formalism. , 1979 .

[8]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[9]  G. Chaitin,et al.  TOWARD A MATHEMATICAL DEFINITION OF “ LIFE ” , 1979 .

[10]  Jean-Michel Claverie,et al.  Information Enhancement Methods for Large Scale Sequence Analysis , 1993, Comput. Chem..

[11]  Aleksandar Milosavljevic,et al.  Reconstruction and analysis of human alu genes , 1991, Journal of Molecular Evolution.

[12]  Robert B. Ash,et al.  Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[13]  James A. Storer,et al.  Data Compression: Methods and Theory , 1987 .

[14]  William Bains,et al.  The multiple origins of human Alu sequences , 2005, Journal of Molecular Evolution.

[15]  L. Allison,et al.  Minimum message length encoding and the comparison of macromolecules , 1990 .

[16]  David Haussler,et al.  The Smallest Automaton Recognizing the Subwords of a Text , 1985, Theor. Comput. Sci..

[17]  B. Rajput,et al.  The human tissue plasminogen activator gene. , 1986, The Journal of biological chemistry.

[18]  A. Milosavljevic,et al.  Discovery by Minimal Length Encoding: A Case Study in Molecular Evolution , 2004, Machine Learning.

[19]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[20]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[21]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[22]  Gregory J. Chaitin,et al.  Algorithmic Information Theory , 1987, IBM J. Res. Dev..