A new fast method for inferring multiple consensus trees using k-medoids

BackgroundGene trees carry important information about specific evolutionary patterns which characterize the evolution of the corresponding gene families. However, a reliable species consensus tree cannot be inferred from a multiple sequence alignment of a single gene family or from the concatenation of alignments corresponding to gene families having different evolutionary histories. These evolutionary histories can be quite different due to horizontal transfer events or to ancient gene duplications which cause the emergence of paralogs within a genome. Many methods have been proposed to infer a single consensus tree from a collection of gene trees. Still, the application of these tree merging methods can lead to the loss of specific evolutionary patterns which characterize some gene families or some groups of gene families. Thus, the problem of inferring multiple consensus trees from a given set of gene trees becomes relevant.ResultsWe describe a new fast method for inferring multiple consensus trees from a given set of phylogenetic trees (i.e. additive trees or X-trees) defined on the same set of species (i.e. objects or taxa). The traditional consensus approach yields a single consensus tree. We use the popular k-medoids partitioning algorithm to divide a given set of trees into several clusters of trees. We propose novel versions of the well-known Silhouette and Caliński-Harabasz cluster validity indices that are adapted for tree clustering with k-medoids. The efficiency of the new method was assessed using both synthetic and real data, such as a well-known phylogenetic dataset consisting of 47 gene trees inferred for 14 archaeal organisms.ConclusionsThe method described here allows inference of multiple consensus trees from a given set of gene trees. It can be used to identify groups of gene trees having similar intragroup and different intergroup evolutionary histories. The main advantage of our method is that it is much faster than the existing tree clustering approaches, while providing similar or better clustering results in most cases. This makes it particularly well suited for the analysis of large genomic and phylogenetic datasets.

[1]  Vincent Berry,et al.  Multipolar consensus for phylogenetic trees. , 2006, Systematic biology.

[2]  Remco R. Bouckaert,et al.  DensiTree: making sense of sets of phylogenetic trees , 2010, Bioinform..

[3]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[4]  Vincent Moulton,et al.  Using consensus networks to visualize contradictory evidence for species phylogeny. , 2004, Molecular biology and evolution.

[5]  Wing-Kin Sung,et al.  Improved Algorithms for Constructing Consensus Trees , 2013, SODA.

[6]  Vladimir Makarenkov,et al.  T-REX: a web server for inferring, validating and visualizing phylogenetic trees and networks , 2012, Nucleic Acids Res..

[7]  D. Maddison The discovery and importance of multiple islands of most , 1991 .

[8]  Vladimir Makarenkov,et al.  On some relations between 2-trees and tree metrics , 1998, Discret. Math..

[9]  Jianrong Dong,et al.  Constructing majority-rule supertrees , 2009, Algorithms for Molecular Biology.

[10]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[11]  David Bryant,et al.  Parsimony via consensus. , 2007, Systematic biology.

[12]  Vladimir Makarenkov,et al.  An Efficient Algorithm for the Detection and Classification of Horizontal Gene Transfer Events and Identification of Mosaic Genes , 2013, Algorithms from and for Nature and Life.

[13]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[14]  Michael J. Brusco,et al.  A note on using the adjusted Rand index for link prediction in networks , 2015, Soc. Networks.

[15]  Vladimir Makarenkov,et al.  Circular orders of tree metrics, and their uses for the reconstruction and fitting of phylogenetic trees , 1996, Mathematical Hierarchies and Biology.

[16]  Luay Nakhleh,et al.  Confounding Factors in HGT Detection: Statistical Error, Coalescent Effects, and Multiple Solutions , 2007, J. Comput. Biol..

[17]  Lawrence Hubert,et al.  The variance of the adjusted Rand index. , 2016, Psychological methods.

[18]  Vladimir Makarenkov,et al.  Towards an accurate identification of mosaic genes and partial horizontal gene transfers , 2011, Nucleic acids research.

[19]  Hervé Philippe,et al.  Archaeal phylogeny based on ribosomal proteins. , 2002, Molecular biology and evolution.

[20]  Ming Li,et al.  Computing the quartet distance between evolutionary trees , 2000, SODA '00.

[21]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[22]  V. Makarenkov,et al.  Inferring and validating horizontal gene transfer events using bipartition dissimilarity. , 2010, Systematic biology.

[23]  V. Daubin,et al.  Modeling gene family evolution and reconciling phylogenetic discord. , 2012, Methods in molecular biology.

[24]  Barbara R. Holland,et al.  Simulating and Summarizing Sources of Gene Tree Incongruence , 2016, Genome biology and evolution.

[25]  Amihood Amir,et al.  Maximum Agreement Subtree in a Set of Evolutionary Trees: Metrics and Efficient Algorithms , 1997, SIAM J. Comput..

[26]  D. Huson,et al.  Dendroscope 3: an interactive tool for rooted phylogenetic trees and networks. , 2012, Systematic biology.

[27]  V. Makarenkov,et al.  A New Fast Method for Detecting and Validating Horizontal Gene Transfer Events Using Phylogenetic Trees and Aggregation Functions , 2015 .

[28]  Tandy J. Warnow,et al.  Statistically based postprocessing of phylogenetic analysis by clustering , 2002, ISMB.

[29]  Luay Nakhleh,et al.  Integrating Sequence and Topology for Efficient and Accurate Detection of Horizontal Gene Transfer , 2008, RECOMB-CG.

[30]  F. McMorris,et al.  The median procedure for n-trees , 1986 .

[31]  Frank Dehne,et al.  The Computational Complexity of the Unrooted Subtree Prune and Regraft Distance , 2006 .

[32]  Luay Nakhleh,et al.  PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships , 2008, BMC Bioinformatics.

[33]  Vladimir Makarenkov,et al.  Comparison of Additive Trees Using Circular Orders , 2000, J. Comput. Biol..

[34]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[35]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[36]  Camille Roth,et al.  Natural Scales in Geographical Patterns , 2017, Scientific Reports.

[37]  Alain Guénoche Multiple consensus trees: a method to separate divergent genes , 2012, BMC Bioinformatics.

[38]  D. Huson,et al.  Application of phylogenetic networks in evolutionary studies. , 2006, Molecular biology and evolution.

[39]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[40]  D. Steinley Properties of the Hubert-Arabie adjusted Rand index. , 2004, Psychological methods.

[41]  M. Cugmas,et al.  On comparing partitions , 2015 .