Treemmer: a tool to reduce large phylogenetic datasets with minimal loss of diversity

BackgroundLarge sequence datasets are difficult to visualize and handle. Additionally, they often do not represent a random subset of the natural diversity, but the result of uncoordinated and convenience sampling. Consequently, they can suffer from redundancy and sampling biases.ResultsHere we present Treemmer, a simple tool to evaluate the redundancy of phylogenetic trees and reduce their complexity by eliminating leaves that contribute the least to the tree diversity.ConclusionsTreemmer can reduce the size of datasets with different phylogenetic structures and levels of redundancy while maintaining a sub-sample that is representative of the original diversity. Additionally, it is possible to fine-tune the behavior of Treemmer including any kind of meta-information, making Treemmer particularly useful for empirical studies.

[1]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[2]  P. Bork,et al.  ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data , 2016, Molecular biology and evolution.

[3]  J. Archibald,et al.  Treetrimmer: a method for phylogenetic dataset size reduction , 2013, BMC Research Notes.

[4]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[5]  Andrew Rambaut,et al.  Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen) , 2016, Virus evolution.

[6]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[7]  Anne E. Magurran,et al.  Biological Diversity: Frontiers in Measurement and Assessment , 2011 .

[8]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[9]  Catherine Macken,et al.  Tree pruner: An efficient tool for selecting data from a biased genetic database , 2011, BMC Bioinformatics.

[10]  Trevor Bedford,et al.  nextflu: real-time tracking of seasonal influenza virus evolution in humans , 2015, Bioinform..

[11]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[12]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[13]  K. Holt,et al.  Out-of-Africa migration and Neolithic co-expansion of Mycobacterium tuberculosis with modern humans , 2013, Nature Genetics.

[14]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[15]  Daniel H. Huson,et al.  Dendroscope: An interactive viewer for large phylogenetic trees , 2007, BMC Bioinformatics.

[16]  Oliviero Carugo,et al.  Protein sequence redundancy reduction: comparison of various method , 2010, Bioinformation.