Identifying Clusters of High Confidence Homologies in Multiple Sequence Alignments

Abstract Multiple sequence alignment (MSA) is ubiquitous in evolution and bioinformatics. MSAs are usually taken to be a known and fixed quantity on which to perform downstream analysis despite extensive evidence that MSA accuracy and uncertainty affect results. These errors are known to cause a wide range of problems for downstream evolutionary inference, ranging from false inference of positive selection to long branch attraction artifacts. The most popular approach to dealing with this problem is to remove (filter) specific columns in the MSA that are thought to be prone to error. Although popular, this approach has had mixed success and several studies have even suggested that filtering might be detrimental to phylogenetic studies. We present a graph-based clustering method to address MSA uncertainty and error in the software Divvier (available at https://github.com/simonwhelan/Divvier), which uses a probabilistic model to identify clusters of characters that have strong statistical evidence of shared homology. These clusters can then be used to either filter characters from the MSA (partial filtering) or represent each of the clusters in a new column (divvying). We validate Divvier through its performance on real and simulated benchmarks, finding Divvier substantially outperforms existing filtering software by retaining more true pairwise homologies calls and removing more false positive pairwise homologies. We also find that Divvier, in contrast to other filtering tools, can alleviate long branch attraction artifacts induced by MSA and reduces the variation in tree estimates caused by MSA uncertainty.

[1]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[2]  J. Huelsenbeck Is the Felsenstein zone a fly trap? , 1997, Systematic biology.

[3]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[4]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[5]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[6]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[7]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .

[8]  Wei Qian,et al.  Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. , 2000, Molecular biology and evolution.

[9]  Louis J. Billera,et al.  Geometry of the Space of Phylogenetic Trees , 2001, Adv. Appl. Math..

[10]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[11]  Lior Pachter,et al.  MAVID: constrained ancestral alignment of multiple sequences. , 2003, Genome research.

[12]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[13]  István Miklós,et al.  Statistical Alignment: Recent Progress, New Applications, and Challenges , 2005 .

[14]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[15]  Simon Whelan,et al.  Statistical Methods in Molecular Evolution , 2005 .

[16]  M. Rosenberg,et al.  Multiple sequence alignment accuracy and phylogenetic inference. , 2006, Systematic biology.

[17]  Dan Graur,et al.  Heads or tails: a simple reliability check for multiple sequence alignments. , 2007, Molecular biology and evolution.

[18]  M. Suchard,et al.  Alignment Uncertainty and Genomic Analysis , 2008, Science.

[19]  A. Löytynoja,et al.  Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis , 2008, Science.

[20]  Toni Gabaldón,et al.  trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses , 2009, Bioinform..

[21]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[22]  Tal Pupko,et al.  An alignment confidence score capturing robustness to guide tree uncertainty. , 2010, Molecular biology and evolution.

[23]  Jeet Sukumaran,et al.  DendroPy: a Python library for phylogenetic computing , 2010, Bioinform..

[24]  J. Scott Provan,et al.  A Fast Algorithm for Computing Geodesic Distances in Tree Space , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[25]  Jonathan A. Eisen,et al.  Accounting For Alignment Uncertainty in Phylogenomics , 2012, PloS one.

[26]  Nick Goldman,et al.  The effects of alignment error and alignment filtering on the sitewise detection of positive selection. , 2012, Molecular biology and evolution.

[27]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[28]  Simon Whelan,et al.  Class of multiple sequence alignment algorithm affects genomic analysis. , 2013, Molecular biology and evolution.

[29]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[30]  Jian Ma,et al.  PSAR-Align: improving multiple sequence alignment using probabilistic sampling , 2014, Bioinform..

[31]  Ziheng Yang,et al.  Molecular Evolution: A Statistical Approach , 2014 .

[32]  Md. Shamsuzzoha Bayzid,et al.  Whole-genome analyses resolve early branches in the tree of life of modern birds , 2014, Science.

[33]  Tal Pupko,et al.  GUIDANCE2: accurate detection of unreliable alignment regions accounting for the uncertainty of multiple parameters , 2015, Nucleic Acids Res..

[34]  Matthieu Muffato,et al.  Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference , 2015, Systematic biology.

[35]  Benjamin P. Blackburne,et al.  Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty , 2015, Genome biology and evolution.

[36]  David A. Morrison,et al.  Molecular homology and multiple-sequence alignment: an analysis of concepts and practice , 2015, Australian Systematic Botany.

[37]  Cédric Notredame,et al.  Multiple sequence alignment modeling: methods and applications , 2016, Briefings Bioinform..

[38]  S. Whelan,et al.  Phylogenetic Tree Estimation With and Without Alignment: New Distance Methods and Benchmarking , 2016, Systematic biology.

[39]  S. Whelan,et al.  Inferring Trees. , 2017, Methods in molecular biology.

[40]  Marcin Bogusz,et al.  Evolutionary Approaches to Sequence Alignment , 2018 .