Do Alignment and Trimming Methods Matter for Phylogenomic (UCE) Analyses?

Alignment is a crucial issue in molecular phylogenetics because different alignment methods can potentially yield very different topologies for individual genes. But it is unclear if the choice of alignment methods remains important in phylogenomic analyses, which incorporate data from dozens, hundreds, or thousands of genes. For example, problematic biases in alignment might be multiplied across many loci, whereas alignment errors in individual genes might become irrelevant. The issue of alignment trimming (i.e. removing poorly aligned regions or missing data from individual genes) is also poorly explored. Here, we test the impact of 12 different combinations of alignment and trimming methods on phylogenomic analyses. We compare these methods using published phylogenomic data from ultraconserved elements (UCEs) from squamate reptiles (lizards and snakes), birds, and tetrapods. We compare the properties of alignments generated by different alignment and trimming methods (e.g., length, informative sites, missing data). We also test whether these datasets can recover well-established clades when analyzed with concatenated (RAxML) and species-tree methods (ASTRAL-III), using the full data (∼5,000 loci) and subsampled datasets (10% and 1% of loci). We show that different alignment and trimming methods can significantly impact various aspects of phylogenomic datasets (e.g. length, informative sites). However, these different methods generally had little impact on the recovery and support values for well-established clades, even across very different numbers of loci. Nevertheless, our results suggest several "best practices" for alignment and trimming. Intriguingly, the choice of phylogenetic methods impacted the results most strongly, with concatenated analyses recovering significantly more well-established clades (with stronger support) than the species-tree analyses.

[1]  Robert C. Edgar,et al.  Multiple sequence alignment. , 2006, Current opinion in structural biology.

[2]  Ryan K. Schott,et al.  Targeted Capture of Complete Coding Regions across Divergent Species , 2017, bioRxiv.

[3]  Tandy J. Warnow,et al.  Ultra-large alignments using phylogeny-aware profiles , 2015, Genome Biology.

[4]  E. Armbrust,et al.  Genome size differentiates co-occurring populations of the planktonic diatom Ditylum brightwellii (Bacillariophyta) , 2010, BMC Evolutionary Biology.

[5]  J. Wiens,et al.  Combining phylogenomic and supermatrix approaches, and a time-calibrated phylogeny for squamate reptiles (lizards and snakes) based on 52 genes and 4162 species. , 2016, Molecular phylogenetics and evolution.

[6]  Tandy J. Warnow,et al.  Long‐Branch Attraction in Species Tree Estimation: Inconsistency of Partitioned Likelihood and Topology‐Based Summary Methods , 2018, Systematic biology.

[7]  Scott V Edwards,et al.  A maximum pseudo-likelihood approach for estimating species trees under the coalescent model , 2010, BMC Evolutionary Biology.

[8]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[9]  Vladimir N Minin,et al.  Detecting the Anomaly Zone in Species Trees and Evidence for a Misleading Signal in Higher-Level Skink Phylogeny (Squamata: Scincidae). , 2016, Systematic biology.

[10]  Hannes Hettling,et al.  phylotaR: An Automated Pipeline for Retrieving Orthologous DNA Sequences from GenBank in R , 2018, Life.

[11]  Travis C Glenn,et al.  Avoiding Missing Data Biases in Phylogenomic Inference: An Empirical Study in the Landfowl (Aves: Galliformes). , 2016, Molecular biology and evolution.

[12]  A. J. Crawford,et al.  Evaluating methods for phylogenomic analyses, and a new phylogeny for a major frog clade (Hyloidea) based on 2214 loci. , 2018, Molecular phylogenetics and evolution.

[13]  Rafe M. Brown,et al.  FrogCap: A modular sequence capture probe set for phylogenomics and population genetics for all frogs, assessed across multiple phylogenetic scales , 2019, bioRxiv.

[14]  Vincent Ranwez,et al.  Strengths and Limits of Multiple Sequence Alignment and Filtering Methods , 2020 .

[15]  Charles W. Linkem,et al.  Phylogenomics of a rapid radiation: is chromosomal evolution linked to increased diversification in north american spiny lizards (Genus Sceloporus)? , 2016, BMC Evolutionary Biology.

[16]  Cédric Notredame,et al.  Multiple sequence alignment modeling: methods and applications , 2016, Briefings Bioinform..

[17]  Stefan Grünewald,et al.  Noisy: Identification of problematic columns in multiple sequence alignments , 2008, Algorithms for Molecular Biology.

[18]  Rafe M. Brown,et al.  Larger, unfiltered datasets are more effective at resolving phylogenetic conflict: Introns, exons, and UCEs resolve ambiguities in Golden-backed frogs (Anura: Ranidae; genus Hylarana). , 2020, Molecular phylogenetics and evolution.

[19]  Tandy Warnow,et al.  Phylogenomic species tree estimation in the presence of incomplete lineage sorting and horizontal gene transfer , 2015, bioRxiv.

[20]  Erin K. Molloy,et al.  The performance of coalescent-based species tree estimation methods under models of missing data , 2018, BMC Genomics.

[21]  R. A. Pyron,et al.  A phylogeny and revised classification of Squamata, including 4161 species of lizards and snakes , 2013, BMC Evolutionary Biology.

[22]  Brant C. Faircloth,et al.  PHYLUCE is a software package for the analysis of conserved genomic loci , 2015, bioRxiv.

[23]  Tandy Warnow,et al.  To include or not to include: The impact of gene filtering on species tree estimation methods , 2017, bioRxiv.

[24]  William A. Freyman SUMAC: Constructing Phylogenetic Supermatrices and Assessing Partially Decisive Taxon Coverage , 2015, Evolutionary bioinformatics online.

[25]  Charles W. Linkem,et al.  Phylogenomics of Horned Lizards (Genus: Phrynosoma) Using Targeted Sequence Capture Data , 2015, Copeia.

[26]  Caitlin A. Kuczynski,et al.  Branch lengths, support, and congruence: testing the phylogenomic approach with 20 nuclear loci in snakes. , 2008, Systematic biology.

[27]  M. Braun,et al.  Extracting Phylogenetic Signal from Phylogenomic Data: Higher-Level Relationships of the Nightbirds (Strisores). , 2019, Molecular phylogenetics and evolution.

[28]  Toni Gabaldón,et al.  trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses , 2009, Bioinform..

[29]  Tandy J. Warnow,et al.  ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes , 2015, Bioinform..

[30]  J. Wiens,et al.  Phylogenomic analyses of more than 4000 nuclear loci resolve the origin of snakes among lizard families , 2017, Biology Letters.

[31]  A. Meyer,et al.  Phylogenomic analysis of a rapid radiation of misfit fishes (Syngnathiformes) using ultraconserved elements. , 2017, Molecular phylogenetics and evolution.

[32]  Tandy J. Warnow,et al.  Naive binning improves phylogenomic analyses , 2013, Bioinform..

[33]  Stephen A. Smith,et al.  PyPHLAWD: A python tool for phylogenetic dataset construction , 2018, Methods in Ecology and Evolution.

[34]  Siavash Mirarab,et al.  Fast Coalescent-Based Computation of Local Branch Support from Quartet Frequencies , 2016, Molecular biology and evolution.

[35]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[36]  S. Hedges,et al.  The phylogeny of squamate reptiles (lizards, snakes, and amphisbaenians) inferred from nine nuclear protein-coding genes. , 2005, Comptes rendus biologies.

[37]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[38]  J. Losos,et al.  Who Speaks with a Forked Tongue? , 2012, Science.

[39]  Andy Purvis,et al.  phyloGenerator: an automated phylogeny generation tool for ecologists , 2013 .

[40]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[41]  Tandy J. Warnow,et al.  PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences , 2015, J. Comput. Biol..

[42]  R. Lanfear,et al.  Estimating Improved Partitioning Schemes for Ultraconserved Elements , 2018, Molecular biology and evolution.

[43]  Wei Qian,et al.  Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. , 2000, Molecular biology and evolution.

[44]  Cédric Notredame,et al.  Upcoming challenges for multiple sequence alignment methods in the high-throughput era , 2009, Bioinform..

[45]  K. Beard,et al.  Fully-sampled phylogenies of squamates reveal evolutionary patterns in threat status , 2016 .

[46]  Liang Liu,et al.  The Impact of Missing Data on Species Tree Estimation. , 2016, Molecular biology and evolution.

[47]  Liang Liu,et al.  Estimating species trees from unrooted gene trees. , 2011, Systematic biology.

[48]  Travis C Glenn,et al.  Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales. , 2012, Systematic biology.

[49]  Michael J Sanderson,et al.  Nematode small subunit phylogeny correlates with alignment parameters. , 2006, Systematic biology.

[50]  J. Wiens,et al.  Phylogenomic analyses reveal novel relationships among snake families. , 2016, Molecular phylogenetics and evolution.

[51]  B. Danforth,et al.  On the universality of target‐enrichment baits for phylogenomic research , 2018 .

[52]  A. Lemmon,et al.  Anchored hybrid enrichment for massively high-throughput phylogenomics. , 2012, Systematic biology.

[53]  Robert S. Harris,et al.  Improved pairwise alignment of genomic dna , 2007 .

[54]  Chao Zhang,et al.  ASTRAL-III: Increased Scalability and Impacts of Contracting Low Support Branches , 2017, RECOMB-CG.

[55]  Tandy J. Warnow,et al.  ASTRAL: genome-scale coalescent-based species tree estimation , 2014, Bioinform..

[56]  Alexandre Antonelli,et al.  A Guide to Carrying Out a Phylogenomic Target Sequence Capture Project , 2019, Frontiers in Genetics.

[57]  Edward L. Braun,et al.  Error in Phylogenetic Estimation for Bushes in the Tree of Life , 2013 .

[58]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[59]  Michael G. Nute,et al.  A comparative study of SVDquartets and other coalescent-based species tree estimation methods , 2015, BMC Genomics.

[60]  Jonathan A. Eisen,et al.  Accounting For Alignment Uncertainty in Phylogenomics , 2012, PloS one.

[61]  J. Wiens,et al.  How Should Genes and Taxa be Sampled for Phylogenomic Analyses with Missing Data? An Empirical Study in Iguanian Lizards. , 2016, Systematic biology.

[62]  J. Bond,et al.  Phylogenomic analysis and revised classification of atypoid mygalomorph spiders (Araneae, Mygalomorphae), with notes on arachnid ultraconserved element loci , 2019, PeerJ.

[63]  Serita M. Nelesen,et al.  SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. , 2012, Systematic biology.

[64]  D. Haussler,et al.  Ultraconserved Elements in the Human Genome , 2004, Science.

[65]  Md. Shamsuzzoha Bayzid,et al.  Whole-genome analyses resolve early branches in the tree of life of modern birds , 2014, Science.

[66]  Frédéric Delsuc,et al.  Phylotranscriptomic consolidation of the jawed vertebrate timetree , 2017, Nature Ecology & Evolution.

[67]  K. Queiroz**,et al.  Phylogenetic relationships within squamata , 1988 .

[68]  E. Louis,et al.  Molecular phylogenetics of squamata: the position of snakes, amphisbaenians, and dibamids, and the root of the squamate tree. , 2004, Systematic biology.

[69]  John J. Wiens,et al.  SuperCRUNCH: A bioinformatics toolkit for creating and manipulating supermatrices and other large phylogenetic datasets , 2020, Methods in Ecology and Evolution.

[70]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[71]  Olivier Poch,et al.  A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives , 2011, PloS one.

[72]  Bruce Rannala,et al.  The accuracy of species tree estimation under simulation: a comparison of methods. , 2011, Systematic biology.

[73]  Tandy Warnow,et al.  ASTRID: Accurate Species TRees from Internode Distances , 2015, bioRxiv.

[74]  P. Bork,et al.  ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data , 2016, Molecular biology and evolution.

[75]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[76]  Michael G. Nute,et al.  Scaling statistical multiple sequence alignment to large datasets , 2016, BMC Genomics.

[77]  M. Braun,et al.  Why Do Phylogenomic Data Sets Yield Conflicting Trees? Data Type Influences the Avian Tree of Life more than Taxon Sampling , 2017, Systematic biology.

[78]  Tandy Warnow,et al.  Evaluating Summary Methods for Multilocus Species Tree Estimation in the Presence of Incomplete Lineage Sorting. , 2016, Systematic biology.

[79]  T. Townsend,et al.  Integrated Analyses Resolve Conflicts over Squamate Reptile Phylogeny and Reveal Unexpected Placements for Fossil Taxa , 2015, PloS one.

[80]  T. Townsend,et al.  Resolving the phylogeny of lizards and snakes (Squamata) with extensive sampling of genes and species , 2012, Biology Letters.

[81]  R. Henrik Nilsson,et al.  Toward a Self-Updating Platform for Estimating Rates of Speciation and Migration, Ages, and Relationships of Taxa , 2016, Systematic biology.

[82]  Alexandre Antonelli,et al.  SECAPR—a bioinformatics pipeline for the rapid and user-friendly processing of targeted enriched Illumina sequences, from raw reads to alignments , 2018, PeerJ.

[83]  Matthieu Muffato,et al.  Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference , 2015, Systematic biology.

[84]  J. Good,et al.  Transcriptome-based exon capture enables highly cost-effective comparative genomic data collection at moderate evolutionary scales , 2012, BMC Genomics.

[85]  M. Rosenberg,et al.  Multiple sequence alignment accuracy and phylogenetic inference. , 2006, Systematic biology.

[86]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[87]  Md. Shamsuzzoha Bayzid,et al.  Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses , 2014, PloS one.

[88]  J. Townsend,et al.  A comprehensive phylogeny of birds (Aves) using targeted next-generation DNA sequencing , 2015, Nature.

[89]  Gerard Talavera,et al.  Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. , 2007, Systematic biology.

[90]  Caitlin A. Kuczynski,et al.  Phylogeny of iguanian lizards inferred from 29 nuclear loci, and a comparison of concatenated and species-tree approaches for an ancient, rapid radiation. , 2011, Molecular phylogenetics and evolution.

[91]  E. N. Arnold A cladisticization: phylogenetic relationships of the lizard families. , 1989, Science.

[92]  K. Bi,et al.  An evaluation of transcriptome‐based exon capture for frog phylogenomics across multiple scales of divergence (Class: Amphibia, Order: Anura) , 2016, Molecular ecology resources.