DNA Sequence Classification Using Compression-Based Induction

Inductive learning methods, such as neural networks and decision trees, have become a popular approach to developing DNA sequence identification tools. Such methods attempt to form models of a collection of training data that can be used to predict future data accurately. The common approach to using such methods on DNA sequence identification problems forms models that depend on the {\em absolute locations} of nucleotides and assume {\em independence} of consecutive nucleotide locations. This paper describes a new class of learning methods, called {\em compression-based induction} (CBI), that is geared towards sequence learning problems such as those that arise when learning DNA sequences. The central idea is to use text compression techniques on DNA sequences as the means for generalizing <from sample sequences. The resulting methods form models that are based on the more important {\em relative locations} of nucleotides and on the {\em dependence} of consecutive locations. They also provide a suitable framework into which biological domain knowledge can be injected into the learning process. We present initial explorations of a range of CBI methods that demonstrate the potential of our methods for DNA sequence identification tasks.

[1]  George S. Sebestyen,et al.  Decision-making processes in pattern recognition , 1962 .

[2]  W. Gilbert,et al.  The lac operator is DNA. , 1967, Proceedings of the National Academy of Sciences of the United States of America.

[3]  R. Bambara,et al.  On the statistical significance of primary structural features found in DNA-protein interaction sites. , 1975, Nucleic acids research.

[4]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[5]  S. Arnott,et al.  A computer aided oligonucleotide analysis provides a model sequence for RNA polymerase-promoter recognition in E.coli. , 1978, Nucleic acids research.

[6]  R. Harr,et al.  Search algorithm for pattern match analysis of nucleic acid sequences. , 1983, Nucleic acids research.

[7]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[8]  M. Waterman,et al.  Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. , 1985, Journal of molecular biology.

[9]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[10]  Martin E. Mulligan,et al.  Analysis of the occurrence of promoter-sites in DNA , 1986, Nucleic Acids Res..

[11]  G. Studnicka,et al.  Nucleotide sequence homologies in control regions of prokaryotic genomes. , 1987, Gene.

[12]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. , 1987, Journal of molecular biology.

[13]  James A. Storer,et al.  Data Compression: Methods and Theory , 1987 .

[14]  C. Harley,et al.  Analysis of E. coli promoter sequences. , 1987, Nucleic acids research.

[15]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. , 1988, Trends in biochemical sciences.

[16]  A V Lukashin,et al.  Neural network models for promoter recognition. , 1989, Journal of biomolecular structure & dynamics.

[17]  M. O'Neill,et al.  Escherichia coli promoters. II. A spacing class-dependent promoter search protocol. , 1989, The Journal of biological chemistry.

[18]  Jude W. Shavlik,et al.  Training Knowledge-Based Neural Networks to Recognize Genes , 1990, NIPS.

[19]  S. Knudsen,et al.  Prediction of human mRNA donor and acceptor sites from the DNA sequence. , 1991, Journal of molecular biology.

[20]  G. Zhou,et al.  Neural network optimization for E. coli promoter prediction. , 1991, Nucleic acids research.

[21]  Pavel A. Pevzner,et al.  Nucleotide Sequences Versus Markov Models , 1992, Comput. Chem..

[22]  Peter Salamon,et al.  A Maximum Entropy Principle for the Distribution of Local Complexity in Naturally Occurring Nucleotide Sequences , 1992, Comput. Chem..

[23]  Gary A. Churchill,et al.  Hidden Markov Chains and the Analysis of Genome Structure , 1992, Comput. Chem..

[24]  S. Karlin,et al.  Over- and under-representation of short oligonucleotides in DNA sequences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Aleksandar Milosavljevic,et al.  Discovering Sequence Similarity by the Algorithmic Significance Method , 1993, ISMB.

[26]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[27]  Aleksandar Milosavljevic,et al.  Discovering simple DNA sequences by the algorithmic significance method , 1993, Comput. Appl. Biosci..

[28]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[29]  Benoist,et al.  On the Entropy of DNA: Algorithms and Measurements based on Memory and Rapid Convergence , 1994 .

[30]  Aleksandar Milosavljevic,et al.  Sequence Comparisons via Algorithmic Mutual Information , 1994, ISMB.

[31]  Haym Hirsh,et al.  Using background knowledge to improve inductive learning: a case study in molecular biology , 1994, IEEE Expert.