Inferring evolution in bacteria using Markov chains and genomic signatures

This thesis concerns the development of methods and models in evolutionary molecular biology. The techniques are also applicable to other similar biological problems. The first contribution is a novel classifier using fixed and variable length Markov chains that can discriminate between bacterial DNA of different species. The classifier assumes that the composition of oligomers, DNA words, is species-specific and represents global features of the species, a so called genomic signature. The direct applications of such a classifier are: identification of horizontal gene transfer and binning of metagenomic data. The former has been the primary focus as it is one of the central processes in the evolution of bacteria. We suggest a new method for locking the number of parameters in a variable length Markov model and propose a method for rejecting false candidates of horizontal gene transfer events. The second contribution is a novel estimator for finding the prediction suffix tree of a variable length Markov chain. This new estimator is highly efficient in finding the correct state-space and we show that it compares favorably to a popular estimator in terms of the predictive likelihood. The third contribution is to the analysis of gene order rearrangements in bacteria. We recapitulate previous results on expected distances and derive new ones for cases that have recently gained support in the literature, such as symmetrical and short reversals. We also describe new categories of gene order patterns and show how these can be explained with models using short, symmetric and uniformly distributed transpositions and reversals. The forth contribution is a part of the Greengenes project which is a chimera free database of 16S rDNA sequences.