Heuristic algorithms for best match graph editing

Background Best match graphs (BMGs) are a class of colored digraphs that naturally appear in mathematical phylogenetics as a representation of the pairwise most closely related genes among multiple species. An arc connects a gene x with a gene y from another species (vertex color) Y whenever it is one of the phylogenetically closest relatives of x . BMGs can be approximated with the help of similarity measures between gene sequences, albeit not without errors. Empirical estimates thus will usually violate the theoretical properties of BMGs. The corresponding graph editing problem can be used to guide error correction for best match data. Since the arc set modification problems for BMGs are NP-complete, efficient heuristics are needed if BMGs are to be used for the practical analysis of biological sequence data. Results Since BMGs have a characterization in terms of consistency of a certain set of rooted triples (binary trees on three vertices) defined on the set of genes, we consider heuristics that operate on triple sets. As an alternative, we show that there is a close connection to a set partitioning problem that leads to a class of top-down recursive algorithms that are similar to Aho’s supertree algorithm and give rise to BMG editing algorithms that are consistent in the sense that they leave BMGs invariant. Extensive benchmarking shows that community detection algorithms for the partitioning steps perform best for BMG editing. Conclusion Noisy BMG data can be corrected with sufficient accuracy and efficiency to make BMGs an attractive alternative to classical phylogenetic methods.

[1]  R. DeSalle,et al.  Speciation and phylogenetic resolution. , 1994, Trends in Ecology & Evolution.

[2]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[3]  D. Bryant Building trees, hunting for trees, and comparing trees : theory and methods in phylogenetic analysis , 1997 .

[4]  Wing-Kin Sung,et al.  Inferring phylogenetic relationships avoiding forbidden rooted triplets , 2006, APBC.

[5]  M. Steel,et al.  Extension Operations on Sets of Leaf-Labeled Trees , 1995 .

[6]  Nansheng Chen,et al.  Genome-Wide Comparative Gene Family Classification , 2010, PloS one.

[7]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[8]  Stefan Boettcher,et al.  Analysis of the Karmarkar-Karp differencing algorithm , 2008, ArXiv.

[9]  Peter F. Stadler,et al.  From pairs of most similar sequences to phylogenetic best matches , 2020, Algorithms for Molecular Biology.

[10]  Dulce I. Valdivia,et al.  Corrigendum to “Best match graphs” , 2021, Journal of Mathematical Biology.

[11]  Peter F. Stadler,et al.  Complete Characterization of Incorrect Orthology Assignments in Best Match Graphs , 2020, ArXiv.

[12]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[13]  Cédric Notredame,et al.  Multiple sequence alignment modeling: methods and applications , 2016, Briefings Bioinform..

[14]  Jesper Jansson,et al.  On the Complexity of Inferring Rooted Evolutionary Trees , 2001, Electron. Notes Discret. Math..

[15]  Jaroslaw Byrka,et al.  New Results on Optimizing Rooted Triplets Consistency , 2008, ISAAC.

[16]  Alfred V. Aho,et al.  Inferring a Tree from Lowest Common Ancestors with an Application to the Optimization of Relational Expressions , 1981, SIAM J. Comput..

[17]  Bang Ye Wu,et al.  Constructing the Maximum Consensus Tree from Rooted Triples , 2004, J. Comb. Optim..

[18]  Peter F. Stadler,et al.  Complexity of Modification Problems for Best Match Graphs , 2020, Theor. Comput. Sci..

[19]  David Sankoff,et al.  Accurate prediction of orthologs in the presence of divergence after duplication , 2018, bioRxiv.

[20]  Siavash Mirarab,et al.  Testing for Polytomies in Phylogenetic Species Trees Using Quartet Frequencies , 2017, Genes.

[21]  Masatoshi Nei,et al.  Evolutionary Distance: Estimation , 2006 .

[22]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[23]  Andrzej Lingas,et al.  On the Complexity of Constructing Evolutionary Trees , 1999, J. Comb. Optim..

[24]  H. A. Orr,et al.  THE POPULATION GENETICS OF ADAPTATION: THE ADAPTATION OF DNA SEQUENCES , 2002, Evolution; international journal of organic evolution.

[25]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[26]  G. Moreno-Hagelsieb,et al.  Progress in quickly finding orthologs as reciprocal best hits: comparing blast, last, diamond and MMseqs2 , 2020, BMC genomics.

[27]  Manja Marz,et al.  Genomewide comparison and novel ncRNAs of Aquificales , 2014, BMC Genomics.

[28]  Richard M. Karp,et al.  The Differencing Method of Set Partitioning , 1983 .

[29]  Peter F. Stadler,et al.  Best Match Graphs with Binary Trees , 2020, AlCoB.

[30]  Sonja J. Prohaska,et al.  Proteinortho: Detection of (Co-)orthologs in large-scale analysis , 2011, BMC Bioinformatics.

[31]  Dulce I. Valdivia,et al.  Best match graphs , 2019, Journal of Mathematical Biology.

[32]  Gabriel Moreno-Hagelsieb,et al.  Choosing BLAST options for better detection of orthologs as reciprocal best hits , 2008, Bioinform..

[33]  J. Krug,et al.  Greedy adaptive walks on a correlated fitness landscape. , 2015, Journal of theoretical biology.

[34]  S. Kauffman,et al.  Towards a general theory of adaptive walks on rugged landscapes. , 1987, Journal of theoretical biology.

[35]  Mechthild Stoer,et al.  A simple min-cut algorithm , 1997, JACM.

[36]  Daniel Doerr,et al.  Orthology Detection Combining Clustering and Synteny for Very Large Datasets , 2014, PloS one.

[37]  Seyed Naser Hashemi,et al.  New Heuristics for Rooted Triplet Consistency , 2013, Algorithms.

[38]  Vincent A. Traag,et al.  Faster unfolding of communities: speeding up the Louvain algorithm , 2015, Physical review. E, Statistical, nonlinear, and soft matter physics.

[39]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[40]  H. Allen Orr THE POPULATION GENETICS OF ADAPTATION: THE ADAPTATION OF DNA SEQUENCES , 2002 .

[41]  David R. Karger,et al.  Global min-cuts in RNC, and other ramifications of a simple min-out algorithm , 1993, SODA '93.