Detecting Sporadic Recombination in DNA Alignments with Hidden Markov Models

Conventional phylogenetic tree estimation methods assume that all sites in a DNA multiple alignment have the same evolutionary history. This assumption is violated in data sets from certain bacteria and viruses due to recombination, a process that leads to the creation of mosaic sequences from different strains and, if undetected, causes systematic errors in phylogenetic tree estimation. In the current work, a hidden Markov model (HMM) is employed to detect recombination events in multiple DNA sequence alignments. The emission probabilities in a given state are determined by the branching order (topology) and the branch lengths of the respective phylogenetic tree, while the transition probabilities depend on the global frequency of recombination. All model parameters are optimized in a maximum likelihood sense with the expectation maximization (EM) algorithm. The resulting parameter optimization scheme is applied to a synthetic benchmark problem and to real DNA sequences from the argF gene of four strains of the bacterium Neisseria. In both cases we find a significant im- provement over an earlier heuristic parameter estimation approach.