Information Theoretic Distance Measures in Phylogenomics

A variety of distance measures has been developed in information theory, proven useful in the application to digital information systems. According to the fact, that the information for a living organism is stored digitally on the information carrier DNA, it seems intuitive to apply these methods to genome analysis. We present two applications to genetics: a compression based distance measure can be used to compute pairwise distances between genomic sequences of unequal lengths and thus recognize the content of a DNA region. The Kullback-Leibler distance will serve as basis for the estimation of evolutionary conservation across the genomes of different species in order to identify regions with potential important functionality. Moreover, we show that we can draw conclusions about the biological properties of the such analyzed sequences.

[1]  Jessica M. Young,et al.  Genome-wide non-mendelian inheritance of extra-genomic information in Arabidopsis , 2005, Nature.

[2]  Simon Whelan,et al.  Statistical Methods in Molecular Evolution , 2005 .

[3]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[4]  Byung-Jun Yoon,et al.  Computational identification and analysis of noncoding RNAs - Unearthing the buried treasures in the genorne , 2007, IEEE Signal Processing Magazine.

[5]  C. Cannings Statistical Methods in Molecular Evolution , 2006 .

[6]  F. Robert,et al.  Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression , 2006 .

[7]  Helen Pearson,et al.  Genetics: What is a gene? , 2006, Nature.

[8]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[9]  Bin Ma,et al.  DNACompress: fast and effective DNA sequence compression , 2002, Bioinform..

[10]  Janis Dingel,et al.  An Alternative Method for Detecting Conserved Regions in Multiple Species , 2005 .

[11]  G. Battail Information Theory and Error-Correcting Codes In Genetics and Biological Evolution , 2008 .

[12]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[13]  Michael R Brent,et al.  Genome annotation past, present, and future: how to define an ORF at each locus. , 2005, Genome research.

[14]  Zaher Dawy,et al.  Mutual information based distance measures for classification and content recognition with applications to genetics , 2005, IEEE International Conference on Communications, 2005. ICC 2005. 2005.

[15]  Andreas Prlic,et al.  Ensembl 2006 , 2005, Nucleic Acids Res..

[16]  P. P. Vaidyanathan,et al.  UNEARTHING THE BURIED TRESASURES-COMPUTATIONAL IDENTIFICATION AND ANALYSIS OF NONCODING RNAS , 2006 .

[17]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[18]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[19]  D. Haussler,et al.  Article Identification and Characterization of Multi-Species Conserved Sequences , 2022 .

[20]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..