Cutting an alignment with Ockham's razor

In this article, we investigate different parsimony-based approaches towards finding recombination breakpoints in a multiple sequence alignment. This recombination detection task is crucial in order to avoid errors in evolutionary analyses caused by mixing together portions of sequences which had a different evolution history. Following an overview of the field of recombination detection, we formulate four computational problems for this task with different objective functions. The four problems aim to minimize (1) the total homoplasy of all blocks (2) the maximum homoplasy per block (3) the total homoplasy ratio of all blocks and (4) the maximum homoplasy ratio per block. We describe algorithms for each of these problems, which are fixed-parameter tractable (FPT) when the characters are binary. We have implemented and tested the algorithms on simulated data, showing that minimizing the total homoplasy gives, in most cases, the most accurate results. Our implementation and experimental data have been made publicly available. Finally, we also consider the problem of combining blocks into non-contiguous blocks consisting of at most p contiguous parts. Fixing the homoplasy h of each block to 0, we show that this problem is NP-hard when p >= 3, but polynomial-time solvable for p = 2. Furthermore, the problem is FPT with parameter h for binary characters when p = 2. A number of interesting problems remain open.

[1]  K. Lole,et al.  Full-Length Human Immunodeficiency Virus Type 1 Genomes from Subtype C-Infected Seroconverters in India, with Evidence of Intersubtype Recombination , 1999, Journal of Virology.

[2]  Vincent Moulton,et al.  RDP3: a flexible and fast computer program for analyzing recombination , 2010, Bioinform..

[3]  J. Hein,et al.  Consequences of recombination on traditional phylogenetic analysis. , 2000, Genetics.

[4]  D. Posada,et al.  Coalescent Simulation of Intracodon Recombination , 2010, Genetics.

[5]  Sergei L. Kosakovsky Pond,et al.  GARD: a genetic algorithm for recombination detection , 2006, Bioinform..

[6]  R. Nielsen,et al.  Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites. , 2003, Genetics.

[7]  E. Harding The probabilities of rooted tree-shapes generated by random bifurcation , 1971, Advances in Applied Probability.

[8]  E. Holmes,et al.  A likelihood method for the detection of selection and recombination using nucleotide sequences. , 1997, Molecular biology and evolution.

[9]  J. Hein,et al.  Recombination and the molecular clock. , 2000, Molecular biology and evolution.

[10]  Jörg Flum,et al.  Parameterized Complexity Theory (Texts in Theoretical Computer Science. An EATCS Series) , 2006 .

[11]  S. Sawyer,et al.  Possible emergence of new geminiviruses by frequent recombination. , 1999, Virology.

[12]  Manolo Gouy,et al.  A Mixture Model and a Hidden Markov Model to Simultaneously Detect Recombination Breakpoints and Reconstruct Phylogenies , 2009, Evolutionary bioinformatics online.

[13]  E. Holmes,et al.  Phylogenetic evidence for recombination in dengue virus. , 1999, Molecular biology and evolution.

[14]  Céline Scornavacca,et al.  OrthoMaM v10: Scaling-Up Orthologous Coding Sequence and Exon Alignments with More than One Hundred Mammalian Genomes , 2019, Molecular biology and evolution.

[15]  Ethan Romero-Severson,et al.  Tracking HIV-1 recombination to resolve its contribution to HIV-1 evolution in natural infection , 2018, Nature Communications.

[16]  D. Husmeier,et al.  Detecting recombination in 4-taxa DNA sequence alignments with Bayesian hidden Markov models and Markov chain Monte Carlo. , 2003, Molecular biology and evolution.

[17]  Luay Nakhleh,et al.  RECOMP: A Parsimony-Based Method for Detecting Recombination , 2005, APBC.

[18]  P. Lemey,et al.  Analysing recombination in nucleotide sequences , 2011, Molecular ecology resources.

[19]  Simone Linz,et al.  Optimizing tree and character compatibility across several phylogenetic trees , 2013, Theor. Comput. Sci..

[20]  Konrad Scheffler,et al.  Robust inference of positive selection from recombining coding sequences , 2006, Bioinform..

[21]  H. Kishino,et al.  Phylogenetic Detection of Recombination with a Bayesian Prior on the Distance between Trees , 2008, PLoS ONE.

[22]  A. Hobolth,et al.  Genomic Relationships and Speciation Times of Human, Chimpanzee, and Gorilla Inferred from a Coalescent Hidden Markov Model , 2006, PLoS genetics.

[23]  David Posada,et al.  An Exact Nonparametric Method for Inferring Mosaic Structure in Sequence Triplets , 2007, Genetics.

[24]  S. Sawyer Statistical tests for detecting gene conversion. , 1989, Molecular biology and evolution.

[25]  P. Buneman The Recovery of Trees from Measures of Dissimilarity , 1971 .

[26]  Mika Salminen,et al.  The Phylogenetic Handbook: Detecting and characterizing individual recombination events , 2009 .

[27]  David Fernández-Baca,et al.  A Polynomial-Time Algorithm for Near-Perfect Phylogeny , 1996, SIAM J. Comput..

[28]  G. McGuire,et al.  A graphical method for detecting recombination in phylogenetic data sets. , 1997, Molecular biology and evolution.

[29]  Graham J. Etherington,et al.  Recombination Analysis Tool (RAT): a program for the high-throughput detection of recombination , 2005, Bioinform..

[30]  David Posada,et al.  The Phylogenetic Handbook: Introduction to recombination detection , 2009 .

[31]  D. Posada,et al.  The Effect of Recombination on the Reconstruction of Ancestral Sequences , 2010, Genetics.

[32]  Pierre Hansen,et al.  Bounded vertex colorings of graphs , 1990, Discret. Math..

[33]  A. Halpern,et al.  A computer program designed to screen rapidly for HIV type 1 intersubtype recombinant sequences. , 1995, AIDS research and human retroviruses.

[34]  Ben Murrell,et al.  Detecting and Analyzing Genetic Recombination Using RDP4. , 2017, Methods in molecular biology.

[35]  John Maynard Smith,et al.  Analyzing the mosaic structure of genes , 1992, Journal of Molecular Evolution.

[36]  M. F. Boni,et al.  Improved Algorithmic Complexity for the 3SEQ Recombination Detection Algorithm , 2017, Molecular biology and evolution.

[37]  A. Hobolth,et al.  Ancestral Population Genomics: The Coalescent Hidden Markov Model Approach , 2009, Genetics.

[38]  David L. Robertson,et al.  T-RECs: rapid and large-scale detection of recombination events among different evolutionary lineages of viral genomes , 2016, BMC Bioinformatics.

[39]  Ziheng Yang,et al.  Challenges in Species Tree Estimation Under the Multispecies Coalescent Model , 2016, Genetics.

[40]  G. Weiller Phylogenetic profiles: a graphical method for detecting genetic recombinations in homologous sequences. , 1998, Molecular biology and evolution.

[41]  J. Hein A heuristic method to reconstruct the history of sequences subject to recombination , 1993, Journal of Molecular Evolution.

[42]  K. Crandall,et al.  The Effect of Recombination on the Accuracy of Phylogeny Estimation , 2002, Journal of Molecular Evolution.

[43]  J. Kingman On the genealogy of large populations , 1982, Journal of Applied Probability.

[44]  M. Suchard,et al.  StepBrothers: inferring partially shared ancestries among recombinant viral sequences. , 2008, Biostatistics.

[45]  Takashi Gojobori,et al.  Genome Biology and Evolution , 2010, Genome Biology and Evolution.

[46]  R. Hudson,et al.  Statistical properties of the number of recombination events in the history of a sample of DNA sequences. , 1985, Genetics.

[47]  Sergei L. Kosakovsky Pond,et al.  Estimating selection pressures on HIV‐1 using phylogenetic likelihood models , 2008, Statistics in medicine.

[48]  D. Aldous PROBABILITY DISTRIBUTIONS ON CLADOGRAMS , 1996 .

[49]  David Posada,et al.  Automated phylogenetic detection of recombination using a genetic algorithm. , 2006, Molecular biology and evolution.

[50]  Daniel H. Huson,et al.  Phylogenetic Networks - Concepts, Algorithms and Applications , 2011 .

[51]  Cécile Ané,et al.  Detecting Phylogenetic Breakpoints and Discordance from Genome-Wide Alignments for Species Tree Reconstruction , 2011, Genome biology and evolution.

[52]  A. R. Wagner Molecular Biology and Evolution , 2001 .

[53]  Vladimir N. Minin,et al.  Dual multiple change-point model leads to more accurate recombination detection , 2005, Bioinform..

[54]  D. Burke,et al.  Identification of breakpoints in intergenotypic recombinants of HIV type 1 by bootscanning. , 1995, AIDS research and human retroviruses.

[55]  K. Crandall,et al.  Evaluation of methods for detecting recombination from DNA sequences: Computer simulations , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[56]  Sampath Kannan,et al.  A fast algorithm for the computation and enumeration of perfect phylogenies when the number of character states is fixed , 1995, SODA '95.

[57]  Mark J. Gibbs,et al.  Sister-Scanning: a Monte Carlo procedure for assessing signals in recombinant sequences , 2000, Bioinform..

[58]  R. Ravi,et al.  Algorithms for Efficient Near-Perfect Phylogenetic Tree Reconstruction in Theory and Practice , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[59]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[60]  John M. Hancock,et al.  Phylogenetic inference under recombination using Bayesian stochastic topology selection , 2008, Bioinform..

[61]  Thomas Lengauer,et al.  Recco: recombination analysis using cost optimization , 2006, Bioinform..

[62]  Alexey M. Kozlov,et al.  ExaML version 3: a tool for phylogenomic analyses on supercomputers , 2015, Bioinform..

[63]  Reed A. Cartwright,et al.  DNA assembly with gaps (Dawg): simulating sequence evolution , 2005, Bioinform..

[64]  David Fernández-Baca,et al.  A Polynomial-Time Algorithm for the Perfect Phylogeny Problem when the Number of Character States is Fixed , 1993, FOCS.