A Monte Carlo approach successfully identifies randomness in multiple sequence alignments: a more objective means of data exclusion.

Random similarity of sequences or sequence sections can impede phylogenetic analyses or the identification of gene homologies. Additionally, randomly similar sequences or ambiguously aligned sequence sections can negatively interfere with the estimation of substitution model parameters. Phylogenomic studies have shown that biases in model estimation and tree reconstructions do not disappear even with large data sets. In fact, these biases can become pronounced with more data. It is therefore important to identify possible random similarity within sequence alignments in advance of model estimation and tree reconstructions. Different approaches have been already suggested to identify and treat problematic alignment sections. We propose an alternative method that can identify random similarity within multiple sequence alignments (MSAs) based on Monte Carlo resampling within a sliding window. The method infers similarity profiles from pairwise sequence comparisons and subsequently calculates a consensus profile. This consensus profile represents a summary of all calculated single similarity profiles. In consequence, consensus profiles identify dominating patterns of nonrandom similarity or randomness within sections of MSAs. We show that the approach clearly identifies randomness in simulated and real data. After the exclusion of putative random sections, node support drastically improves in tree reconstructions of both data. It thus appears to be a powerful tool to identify possible biases of tree reconstructions or gene identification. The method is currently restricted to nucleotide data but will be extended to protein data in the near future.

[1]  M S Waterman,et al.  Sequence alignment and penalty choice. Review of concepts, case studies and implications. , 1994, Journal of molecular biology.

[2]  Y. Inagaki,et al.  Phylogenetic estimation under codon models can be biased by codon usage heterogeneity. , 2006, Molecular phylogenetics and evolution.

[3]  Tero Aittokallio,et al.  A statistical score for assessing the quality of multiple sequence alignments , 2006, BMC Bioinformatics.

[4]  H. Philippe,et al.  Archaea sister group of Bacteria? Indications from tree reconstruction artifacts in ancient phylogenies. , 1999, Molecular biology and evolution.

[5]  Erik L. L. Sonnhammer,et al.  Kalign – an accurate and fast multiple sequence alignment algorithm , 2005, BMC Bioinformatics.

[6]  Dan Graur,et al.  Heads or tails: a simple reliability check for multiple sequence alignments. , 2007, Molecular biology and evolution.

[7]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[8]  A. Dress,et al.  Split decomposition: a new and useful approach to phylogenetic analysis of distance data. , 1992, Molecular phylogenetics and evolution.

[9]  Olivier Poch,et al.  RASCAL: Rapid Scanning and Correction of Multiple Sequence Alignments , 2003, Bioinform..

[10]  Ari Löytynoja,et al.  SOAP, cleaning multiple alignments from unstable blocks , 2001, Bioinform..

[11]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[12]  J. Hein,et al.  Combining many multiple alignments in one improved alignment , 1999, Bioinform..

[13]  A. Dress,et al.  Multiple DNA and protein sequence alignment based on segment-to-segment comparison. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Gajendra P. S. Raghava,et al.  OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy , 2003, BMC Bioinformatics.

[15]  Erik L L Sonnhammer,et al.  Quality assessment of multiple alignment programs , 2002, FEBS letters.

[16]  Joseph J Gillespie Characterizing regions of ambiguous alignment caused by the expansion and contraction of hairpin-stem loops in ribosomal RNA molecules. , 2004, Molecular phylogenetics and evolution.

[17]  W C Wheeler,et al.  Elision: a method for accommodating multiple molecular sequence alignments with alignment-ambiguous sites. , 1995, Molecular phylogenetics and evolution.

[18]  Erik L. L. Sonnhammer,et al.  Automatic assessment of alignment quality , 2005, Nucleic acids research.

[19]  K. Kjer,et al.  Phylogeny of Trichoptera (caddisflies): characterization of signal and noise within multiple datasets. , 2001, Systematic biology.

[20]  Sean R. Eddy,et al.  Biological sequence analysis: Contents , 1998 .

[21]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[22]  A. Phillips,et al.  Multiple sequence alignment in phylogenetic analysis. , 2000, Molecular phylogenetics and evolution.

[23]  Martin Tompa,et al.  Statistics of local multiple alignments , 2005, ISMB.

[24]  W. Wheeler,et al.  The position of arthropods in the animal kingdom: Ecdysozoa, islands, trees, and the "Parsimony ratchet". , 1999, Molecular phylogenetics and evolution.

[25]  Naiara Rodríguez-Ezpeleta,et al.  Detecting and overcoming systematic errors in genome-scale phylogenies. , 2007, Systematic biology.

[26]  D A Morrison,et al.  Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of apicomplexa. , 1997, Molecular biology and evolution.

[27]  P. Wagner,et al.  Integrating ambiguously aligned regions of DNA sequences in phylogenetic analyses without violating positional homology. , 2000, Systematic biology.

[28]  J. G. Burleigh,et al.  Phylogenetic signal in nucleotide data from seed plants: implications for resolving the seed plant tree of life. , 2004, American journal of botany.

[29]  M. Suchard,et al.  Joint Bayesian estimation of alignment and phylogeny. , 2005, Systematic biology.

[30]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[31]  Wei Qian,et al.  Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. , 2000, Molecular biology and evolution.

[32]  Andrew J. Roger,et al.  Biases in Phylogenetic Estimation Can Be Caused by Random Sequence Segments , 2005, Journal of Molecular Evolution.

[33]  D. Huson,et al.  Application of phylogenetic networks in evolutionary studies. , 2006, Molecular biology and evolution.

[34]  Christoph Mayer,et al.  Visualizing differences in phylogenetic information content of alignments and distinction of three classes of long-branch effects , 2007, BMC Evolutionary Biology.

[35]  R DeSalle,et al.  Alignment-ambiguous nucleotide sites and the exclusion of systematic data. , 1993, Molecular phylogenetics and evolution.

[36]  J. Shultz,et al.  Ecdysozoan phylogeny and Bayesian inference: first use of nearly complete 28S and 18S rRNA gene sequences to classify the arthropods and their kin. , 2004, Molecular phylogenetics and evolution.

[37]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[38]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[39]  K. Kjer,et al.  Opinions on multiple sequence alignment, and an empirical comparison of repeatability and accuracy between POY and structural alignment. , 2007, Systematic biology.

[40]  Benjamin D. Redelings,et al.  BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny , 2006, Bioinform..

[41]  C. Notredame,et al.  Recent progress in multiple sequence alignment: a survey. , 2002, Pharmacogenomics.

[42]  F. Delsuc,et al.  Phylogenomics: the beginning of incongruence? , 2006, Trends in genetics : TIG.

[43]  Gerard Talavera,et al.  Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. , 2007, Systematic biology.

[44]  S. Beverley,et al.  Evolution of nuclear ribosomal RNAs in kinetoplastid protozoa: perspectives on the age and origins of parasitism. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[45]  Qingxin Zhu,et al.  [Recent progress in multiple sequence alignment]. , 2010, Sheng wu yi xue gong cheng xue za zhi = Journal of biomedical engineering = Shengwu yixue gongchengxue zazhi.

[46]  E. Herniou,et al.  Acoel flatworms: earliest extant bilaterian Metazoans, not members of Platyhelminthes. , 1999, Science.