Assessing Multiple Sequence Alignments Using Visual Tools

Bioinformatics and molecular evolutionary analyses most often start with comparing DNA or amino acid sequences by aligning them. Pairwise alignment, for example, is used to measure the similarities between a query sequence and each of those in a database in BLAST similarity search, the most used bioinformatics tool (Altschul et al., 1990; Camacho et al., 2009). Evolutionary history among sequences can be reflected better when more than two sequences are aligned, in a multiple sequence alignment (MSA). When building an MSA, we assume that the sequences compared are derived from a common ancestral sequence. Then the process of MSA building is to infer homologous positions between the input sequences and place gaps in the sequences in order to align these homologous positions. These gaps represent evolutionary events of their own. Gaps (also called indels) are caused by either insertions or deletions of characters (nucleotides or amino acids) on a particular lineage of sequences during the evolution. Building an MSA is, therefore, to reconstruct the evolutionary history of the sequences involved. While it is easy to understand that the quality of MSAs affects the quality of phylogenetic tree reconstruction, the effect of MSA quality reaches far beyond this. Some examples of bioinformatics methods that utilize information extracted from MSAs include: profile building in similarity search (e.g., PSIBLAST: Altschul et al., 1997), motif/profile recognition (e.g., PROSITE: Hulo et al., 2008), profile hidden Markov models for protein families/domains (e.g., Pfam: Finn et al., 2010), and protein secondary-structure prediction (for review, see Pirovano & Heringa, 2010). There are numerous bioinformatics and molecular evolutionary analyses that are affected by MSA quality and they can be benefited by having reliable MSAs. Despite the significance of having good MSAs, assessing MSA quality is far from straightforward. Measuring the quality of MSAs requires two components: a benchmark dataset and a scoring method. A benchmark dataset includes reference alignments. These alignments are considered to represent the evolutionary history of the sequences truthfully. The same set of sequences included in a reference alignment is then aligned using the MSA methods to be tested. The reconstructed MSA can be compared with the reference MSA using a scoring method and the quality of the reconstructed MSA is assessed compared to the

[1]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[2]  Kazutaka Katoh,et al.  Recent developments in the MAFFT multiple sequence alignment program , 2008, Briefings Bioinform..

[3]  D R Flower,et al.  The lipocalin protein family: structural and sequence overview. , 2000, Biochimica et biophysica acta.

[4]  Lode Wyns,et al.  SABmark- a benchmark for sequence alignment that covers the entire known fold space , 2005, Bioinform..

[5]  Dennis R. Livesay,et al.  Probalign: multiple sequence alignment using partition function posterior probabilities , 2006, Bioinform..

[6]  Etsuko N. Moriyama,et al.  SuiteMSA: visual tools for multiple sequence alignment comparison and molecular sequence simulation , 2011, BMC Bioinformatics.

[7]  Nan Yu,et al.  The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs , 2002, BMC Bioinformatics.

[8]  N. Grishin,et al.  PROMALS3D: a tool for multiple protein sequence and structure alignments , 2008, Nucleic acids research.

[9]  O. Gascuel,et al.  New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. , 2010, Systematic biology.

[10]  Gajendra P. S. Raghava,et al.  OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy , 2003, BMC Bioinformatics.

[11]  Kenji Mizuguchi,et al.  HOMSTRAD: recent developments of the Homologous Protein Structure Alignment Database , 2004, Nucleic Acids Res..

[12]  Lukas Käll,et al.  A general model of G protein‐coupled receptor sequences and its application to detect remote homologs , 2006, Protein science : a publication of the Protein Society.

[13]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[14]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[15]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[16]  A. Löytynoja,et al.  Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis , 2008, Science.

[17]  David T. Jones,et al.  Transmembrane protein topology prediction using support vector machines , 2009, BMC Bioinformatics.

[18]  Christian J. A. Sigrist,et al.  Nucleic Acids Research Advance Access published November 14, 2007 The 20 years of PROSITE , 2007 .

[19]  Masami Ikeda,et al.  Proteome-wide classification and identification of mammalian-type GPCRs by binary topology pattern , 2004, Comput. Biol. Chem..

[20]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[21]  Michael Kaufmann,et al.  BMC Bioinformatics BioMed Central , 2005 .

[22]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[23]  Cory L. Strope,et al.  Biological Sequence Simulation for Testing Complex Evolutionary Hypotheses: indel-Seq-Gen Version 2.0 , 2009, Molecular biology and evolution.

[24]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[25]  Cédric Notredame,et al.  Upcoming challenges for multiple sequence alignment methods in the high-throughput era , 2009, Bioinform..

[26]  D. Morrison Why would phylogeneticists ignore computerized sequence alignment? , 2009, Systematic biology.

[27]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[28]  Olivier Poch,et al.  A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives , 2011, PloS one.

[29]  D. Morrison Evolution of the Apicomplexa: where are we now? , 2009, Trends in parasitology.

[30]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[31]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[32]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[33]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[34]  Robert C. Edgar,et al.  Quality measures for protein alignment benchmarks , 2010, Nucleic acids research.

[35]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[36]  Melissa S. Cline,et al.  Predicting reliable regions in protein sequence alignments , 2002, Bioinform..

[37]  Jaap Heringa,et al.  PRALINETM: a strategy for improved multiple alignment of transmembrane proteins , 2008, Bioinform..

[38]  Jimin Pei,et al.  PROMALS: towards accurate multiple sequence alignments of distantly related proteins , 2007, Bioinform..

[39]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[40]  Jaap Heringa,et al.  Protein secondary structure prediction. , 2010, Methods in molecular biology.

[41]  Gert Vriend,et al.  GPCRDB information system for G protein-coupled receptors , 2003, Nucleic Acids Res..

[42]  Ari Löytynoja,et al.  An algorithm for progressive multiple alignment of sequences with insertions. , 2005, Proceedings of the National Academy of Sciences of the United States of America.