On the Entropy of DNA: Algorithms and Measurements based on Memory and Rapid Convergence

We have applied the information theoretic notion of entropy to characterize DNA sequences. We consider a genetic sequence signal that is too small for asymptotic entropy estimates to be accurate, and for which similar approaches have previously failed. We prove that the match length entropy estimator has a relatively fast converge rate and demonstrate experimentally that by using this entropy estimator, we can indeed extract a meaningful signal from segments of DNA. Further, we derive a method for detecting certain signals within DNA { known as splice junctions { with signi cantly better performance than previously known methods. The main result of this paper is that we nd that the entropy of genetic material which is ultimately expressed in protein sequences is higher than that which is discarded. This is an unexpected result, since current biological theory holds that the discarded sequences (\introns") are capable of tolerating random changes to a greater dey farach@cs.rutgers.edu; Supported by DIMACS (Center for Discrete Mathematics and Theoretical Computer Science), a National Science Foundation Science and Technology Center under NSF contract STC-8809648. z noordewi@cs.rutgers.edu x ayse@mit.edu { las@research.att.com k ajw@playfair.stanford.edu jz@ee.technion.ac.il gree than the retained sequences (\exons").