Heads or tails: a simple reliability check for multiple sequence alignments.

The question of multiple sequence alignment quality has received much attention from developers of alignment methods. Less forthcoming, however, are practical measures for addressing alignment quality issues in real life settings. Here, we present a simple methodology to help identify and quantify the uncertainties in multiple sequence alignments and their effects on subsequent analyses. The proposed methodology is based upon the a priori expectation that sequence alignment results should be independent of the orientation of the input sequences. Thus, for totally unambiguous cases, reversing residue order prior to alignment should yield an exact reversed alignment of that obtained by using the unreversed sequences. Such "ideal" alignments, however, are the exception in real life settings, and the two alignments, which we term the heads and tails alignments, are usually different to a greater or lesser degree. The degree of agreement or discrepancy between these two alignments may be used to assess the reliability of the sequence alignment. Furthermore, any alignment dependent sequence analysis protocol can be carried out separately for each of the two alignments, and the two sets of results may be compared with each other, providing us with valuable information regarding the robustness of the whole analytical process. The heads-or-tails (HoT) methodology can be easily implemented for any choice of alignment method and for any subsequent analytical protocol. We demonstrate the utility of HoT for phylogenetic reconstruction for the case of 130 sequences belonging to the chemoreceptor superfamily in Drosophila melanogaster, and by analysis of the BaliBASE alignment database. Surprisingly, Neighbor-Joining methods of phylogenetic reconstruction turned out to be less affected by alignment errors than maximum likelihood and Bayesian methods.

[1]  D. O’Callaghan,et al.  A homologue of the Agrobacterium tumefaciens VirB and Bordetella pertussis Ptl type IV secretion systems is essential for intracellular survival of Brucella suis , 1999, Molecular microbiology.

[2]  Erik L L Sonnhammer,et al.  Quality assessment of multiple alignment programs , 2002, FEBS letters.

[3]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[4]  M. Rosenberg,et al.  Multiple sequence alignment accuracy and phylogenetic inference. , 2006, Systematic biology.

[5]  J. Carlson,et al.  Molecular evolution of the insect chemoreceptor gene superfamily in Drosophila melanogaster , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Elisabeth R. M. Tillier,et al.  The accuracy of several multiple sequence alignment programs for proteins , 2006, BMC Bioinformatics.

[7]  C. Kubicek,et al.  Phylogeny and evolution of the genus Trichoderma: a multigene approach , 2002 .

[8]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[9]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[10]  Emmet A. O'Brien,et al.  Empirical estimation of the reliability of ribosomal RNA alignments , 1998, Bioinform..

[11]  Mark S. Boguski,et al.  Similarity and Homology , 1991 .

[12]  Lisa J. Mullan Multiple Sequence Alignment - The Gateway to Further Analysis , 2002, Briefings Bioinform..

[13]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[14]  John P. Huelsenbeck,et al.  MrBayes 3: Bayesian phylogenetic inference under mixed models , 2003, Bioinform..

[15]  D A Morrison,et al.  Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of apicomplexa. , 1997, Molecular biology and evolution.

[16]  C. Simon,et al.  The performance of several multiple-sequence alignment programs in relation to secondary-structure features for an rRNA sequence. , 2000, Molecular biology and evolution.

[17]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[18]  T. Gregory Dewey,et al.  A Sequence Alignment Algorithm with an Arbitrary Gap Penalty Function , 2001, J. Comput. Biol..

[19]  Arne Elofsson,et al.  A study on protein sequence alignment quality , 2002, Proteins.

[20]  N. Shimizu,et al.  Propagation and maintenance of the 119 human immunoglobulin Vlambda genes and pseudogenes during evolution. , 2000, The Journal of experimental zoology.

[21]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[22]  M. Gribskov,et al.  Sequence Analysis Primer , 1991 .

[23]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[24]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[25]  M. P. Cummings PHYLIP (Phylogeny Inference Package) , 2004 .

[26]  Sudhir Kumar,et al.  Multiple sequence alignment: in pursuit of homologous DNA positions. , 2007, Genome research.