Computation and analysis of genomic multi-sequence alignments.

Multi-sequence alignments of large genomic regions are at the core of many computational genome-annotation approaches aimed at identifying coding regions, RNA genes, regulatory regions, and other functional features. Such alignments also underlie many genome-evolution studies. Here we review recent computational advances in the area of multi-sequence alignment, focusing on methods suitable for aligning whole vertebrate genomes. We introduce the key algorithmic ideas in use today, and identify publicly available resources for computing, accessing, and visualizing genomic alignments. Finally, we describe the latest alignment-based approaches to identify and characterize various types of functional sequences. Key areas of research are identified and directions for future improvements are suggested.

[1]  Wyeth W. Wasserman,et al.  ConSite: web-based prediction of regulatory elements using cross-species comparison , 2004, Nucleic Acids Res..

[2]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[3]  D. Haussler,et al.  Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[5]  Colin N. Dewey,et al.  Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution , 2004, Nature.

[6]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[7]  David Haussler,et al.  Identification and Classification of Conserved RNA Secondary Structures in the Human Genome , 2006, PLoS Comput. Biol..

[8]  István Miklós,et al.  Bayesian coestimation of phylogeny and sequence alignment , 2005, BMC Bioinformatics.

[9]  D. Sankoff Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems , 1985 .

[10]  Serafim Batzoglou,et al.  The many faces of sequence alignment , 2005, Briefings Bioinform..

[11]  Jens Stoye,et al.  Benchmarking tools for the alignment of functional noncoding DNA , 2004, BMC Bioinformatics.

[12]  Paramvir S. Dehal,et al.  Whole-Genome Shotgun Assembly and Analysis of the Genome of Fugu rubripes , 2002, Science.

[13]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[14]  Jean L. Chang,et al.  An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[15]  D. T. Lee,et al.  SinicView: A visualization environment for comparisons of multiple nucleotide sequence alignment tools , 2006, BMC Bioinformatics.

[16]  R. Durbin,et al.  Vertebrate gene finding from multiple-species alignments using a two-level strategy , 2006, Genome Biology.

[17]  Peter F Stadler,et al.  Fast and reliable prediction of noncoding RNAs , 2005, Proc. Natl. Acad. Sci. USA.

[18]  Lior Pachter,et al.  MAVID: constrained ancestral alignment of multiple sequences. , 2003, Genome research.

[19]  S. Prabhakar,et al.  Annotation of cis-regulatory elements by identification, subclassification, and functional assessment of multispecies conserved sequences. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[20]  J. Hein,et al.  Statistical alignment: computational properties, homology testing and goodness-of-fit. , 2000, Journal of molecular biology.

[21]  D. Higgins,et al.  Multiple sequence alignments. , 2005, Current opinion in structural biology.

[22]  M. Blanchette,et al.  Overview of the First Phylogenomics Conference , 2007, BMC Evolutionary Biology.

[23]  Ivan Ovcharenko,et al.  rVISTA 2.0: evolutionary analysis of transcription factor binding sites , 2004, Nucleic Acids Res..

[24]  W. Miller,et al.  Distinguishing regulatory DNA from neutral sites. , 2003, Genome research.

[25]  Isaac Elias,et al.  Settling the Intractability of Multiple Alignment , 2003, ISAAC.

[26]  Eray Tüzün,et al.  Manipulating multiple sequence alignments via MaM and WebMaM , 2005, Nucleic Acids Res..

[27]  David Haussler,et al.  Patterns of insertions and their covariation with substitutions in the rat, mouse, and human genomes. , 2004, Genome research.

[28]  M. Blanchette,et al.  Discovery of regulatory elements by a computational method for phylogenetic footprinting. , 2002, Genome research.

[29]  D. Church,et al.  Cross-species sequence comparisons: a review of methods and available resources. , 2003, Genome research.

[30]  L. Fulton,et al.  Finding Functional Features in Saccharomyces Genomes by Phylogenetic Footprinting , 2003, Science.

[31]  Jaap Heringa,et al.  AuberGene - a sensitive genome alignment tool , 2006, Bioinform..

[32]  Michael B. Eisen,et al.  Detecting the limits of regulatory element conservation and divergence estimation using pairwise and multiple alignments , 2006, BMC Bioinformatics.

[33]  Sorin Istrail,et al.  Finding anchors for genomic sequence comparison , 2004, RECOMB.

[34]  H. M. Martinez,et al.  A multiple sequence alignment program , 1986, Nucleic Acids Res..

[35]  S. Batzoglou,et al.  Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes. , 2003, Genome research.

[36]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[37]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[38]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[39]  Benjamin J. Raphael,et al.  AliWABA: alignment on the web through an A-Bruijn approach , 2006, Nucleic Acids Res..

[40]  Yu Zhang,et al.  An Eulerian path approach to local multiple alignment for DNA sequences. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Webb Miller,et al.  GALA, a database for genomic sequence alignments and annotations. , 2003, Genome research.

[42]  Xiaoqiu Huang,et al.  MAP2: multiple alignment of syntenic genomic sequences , 2005, Nucleic acids research.

[43]  Terrence S. Furey,et al.  The UCSC Table Browser data retrieval tool , 2004, Nucleic Acids Res..

[44]  Graziano Pesole,et al.  In silico representation and discovery of transcription factor binding sites , 2004, Briefings Bioinform..

[45]  D. Haussler,et al.  Reconstructing large regions of an ancestral mammalian genome in silico. , 2004, Genome research.

[46]  Francesca Chiaromonte,et al.  Regulatory potential scores from genome-wide three-way alignments of human, mouse, and rat. , 2004, Genome research.

[47]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[48]  Ian Holmes,et al.  Using evolutionary Expectation Maximization to estimate indel rates , 2005, Bioinform..

[49]  Alan M. Moses,et al.  MONKEY: identifying conserved transcription-factor binding sites in multiple alignments using a binding site-specific evolutionary model , 2004, Genome Biology.

[50]  David Haussler,et al.  The UCSC genome browser database: update 2007 , 2006, Nucleic Acids Res..

[51]  Michael S. Rosenberg,et al.  Multiple sequence alignment accuracy and evolutionary distance estimation , 2005, BMC Bioinformatics.

[52]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[53]  Jean L. Chang,et al.  Initial sequence of the chimpanzee genome and comparison with the human genome , 2005, Nature.

[54]  Mathieu Blanchette,et al.  Motif Discovery in Heterogeneous Sequence Data , 2003, Pacific Symposium on Biocomputing.

[55]  A. Clark,et al.  Evolution of transcription factor binding sites in Mammalian gene regulatory regions: conservation and turnover. , 2002, Molecular biology and evolution.

[56]  Erik van Nimwegen,et al.  PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny , 2005, PLoS Comput. Biol..

[57]  L. Pachter,et al.  Strategies and tools for whole-genome alignments. , 2002, Genome research.

[58]  Sonja J. Prohaska,et al.  Multiple sequence alignment with user-defined constraints at GOBICS , 2005, Bioinform..

[59]  Michael R. Brent,et al.  Using Multiple Alignments to Improve Gene Prediction , 2005, RECOMB.

[60]  Martin Tompa,et al.  Statistics of local multiple alignments , 2005, ISMB.

[61]  Mathieu Blanchette,et al.  On the Inference of Parsimonious Indel Evolutionary Scenarios , 2006, J. Bioinform. Comput. Biol..

[62]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[63]  Simon Cawley,et al.  Accurate identification of novel human genes through simultaneous gene prediction in human, mouse, and rat. , 2004, Genome research.

[64]  Terrence S. Furey,et al.  The UCSC Genome Browser Database: update 2006 , 2005, Nucleic Acids Res..

[65]  Robert C. Edgar,et al.  Multiple sequence alignment. , 2006, Current opinion in structural biology.

[66]  S. Kasif,et al.  Human-mouse gene identification by comparative evidence integration and evolutionary analysis. , 2003, Genome research.

[67]  Lisa M. D'Souza,et al.  Genome sequence of the Brown Norway rat yields insights into mammalian evolution , 2004, Nature.

[68]  Eric D Green,et al.  Differences between pair-wise and multi-sequence alignment methods affect vertebrate genome comparisons. , 2006, Trends in genetics : TIG.

[69]  Burkhard Morgenstern,et al.  Multiple alignment of genomic sequences using CHAOS, DIALIGN and ABC , 2005, Nucleic Acids Res..

[70]  W. Miller,et al.  Mulan: multiple-sequence local alignment and visualization for studying function and evolution. , 2005, Genome research.

[71]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[72]  D. Higgins,et al.  See Blockindiscussions, Blockinstats, Blockinand Blockinauthor Blockinprofiles Blockinfor Blockinthis Blockinpublication Clustal: Blockina Blockinpackage Blockinfor Blockinperforming Multiple Blockinsequence Blockinalignment Blockinon Blockina Minicomputer Article Blockin Blockinin Blockin , 2022 .

[73]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[74]  I. Ovcharenko,et al.  eShadow: a tool for comparing closely related sequences. , 2004, Genome research.

[75]  Benjamin J. Raphael,et al.  A novel method for multiple alignment of sequences with repeated and shuffled elements. , 2004, Genome research.

[76]  Nicholas L. Bray,et al.  AVID: A global alignment program. , 2003, Genome research.

[77]  Geoffrey J. Barton,et al.  The Jalview Java alignment editor , 2004, Bioinform..

[78]  Peter F. Stadler,et al.  Multiple sequence alignments of partially coding nucleic acid sequences , 2005, BMC Bioinformatics.

[79]  Eugene Berezikov,et al.  CONREAL: conserved regulatory elements anchored alignment algorithm for identification of transcription factor binding sites by phylogenetic footprinting. , 2003, Genome research.

[80]  E. Birney,et al.  EnsMart: a generic system for fast and flexible access to biological data. , 2003, Genome research.

[81]  D. Haussler,et al.  Article Identification and Characterization of Multi-Species Conserved Sequences , 2022 .

[82]  Mehmet M. Dalkilic,et al.  COMPAM : visualization of combining pairwise alignments for multiple genomes , 2006, Bioinform..

[83]  Francesca Chiaromonte,et al.  ESPERR: learning strong and weak signals in genomic sequence alignments to identify functional elements. , 2006, Genome research.

[84]  Michael Brudno,et al.  Fast and sensitive multiple alignment of large genomic sequences , 2003, BMC Bioinformatics.

[85]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[86]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[87]  Colin N. Dewey,et al.  Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. , 2007, Genome research.

[88]  Colin N. Dewey,et al.  Evolution at the nucleotide level: the problem of multiple whole-genome alignment. , 2006, Human molecular genetics.

[89]  Alexander Sczyrba,et al.  AltAVisT: Comparing alternative multiple sequence alignments , 2003, Bioinform..

[90]  F. Delsuc,et al.  Phylogenomics and the reconstruction of the tree of life , 2005, Nature Reviews Genetics.

[91]  Berthold Göttgens,et al.  Analysis of multiple genomic sequence alignments: a web resource, online tools, and lessons learned from analysis of mammalian SCL loci. , 2004, Genome research.

[92]  S. Altschul,et al.  A tool for multiple sequence alignment. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[93]  S. Batzoglou,et al.  Distribution and intensity of constraint in mammalian genomic sequence. , 2005, Genome research.

[94]  Inna Dubchak,et al.  Automated whole-genome multiple alignment of rat, mouse, and human. , 2004, Genome research.

[95]  Chunlin Wang,et al.  Genomic multiple sequence alignments: refinement using a genetic algorithm , 2005, BMC Bioinformatics.

[96]  F. Blattner,et al.  Mauve: multiple alignment of conserved genomic sequence with rearrangements. , 2004, Genome research.

[97]  Michael Krawczak,et al.  Microdeletions and microinsertions causing human genetic disease: common mechanisms of mutagenesis and the role of local DNA sequence complexity , 2005, Human mutation.

[98]  David A. Nix,et al.  Large-Scale Turnover of Functional Transcription Factor Binding Sites in Drosophila , 2006, PLoS Comput. Biol..

[99]  Jeremy Buhler,et al.  Choosing the best heuristic for seeded alignment of DNA sequences , 2006, BMC Bioinformatics.

[100]  Lior Pachter,et al.  Visualization of multiple genome annotations and alignments with the K-BROWSER. , 2004, Genome research.

[101]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[102]  Burkhard Morgenstern,et al.  Exon discovery by genomic sequence alignment , 2002, Bioinform..

[103]  S. Batzoglou,et al.  Characterization of evolutionary rates and constraints in three Mammalian genomes. , 2004, Genome research.

[104]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[105]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[106]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[107]  Mathieu Blanchette,et al.  PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences , 2004, BMC Bioinformatics.

[108]  Sean R. Eddy,et al.  Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints , 2006, BMC Bioinformatics.

[109]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[110]  Inna Dubchak,et al.  VISTA family of computational tools for comparative analysis of DNA sequences and whole genomes. , 2006, Methods in molecular biology.

[111]  B. Snel,et al.  Genome trees and the nature of genome evolution. , 2005, Annual review of microbiology.

[112]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[113]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[114]  Balaji Raghavachari,et al.  Chaining Multiple-Alignment Blocks , 1994, J. Comput. Biol..

[115]  Serafim Batzoglou,et al.  Using multiple alignments to improve seeded local alignment algorithms , 2005, Nucleic acids research.

[116]  R. Gibbs,et al.  PipMaker--a web server for aligning two genomic DNA sequences. , 2000, Genome research.

[117]  D. Haussler,et al.  Ultraconserved Elements in the Human Genome , 2004, Science.

[118]  Mathieu Blanchette,et al.  PReMod: a database of genome-wide mammalian cis-regulatory module predictions , 2006, Nucleic Acids Res..

[119]  Gregory M. Cooper,et al.  ABC: software for interactive browsing of genomic multiple sequence alignment data , 2004, BMC Bioinformatics.

[120]  Kun-Mao Chao,et al.  A generalized global alignment algorithm , 2003, Bioinform..

[121]  Webb Miller,et al.  Comparison of genomic DNA sequences: solved and unsolved problems , 2001, Bioinform..

[122]  Lior Pachter,et al.  VISTA: computational tools for comparative genomics , 2004, Nucleic Acids Res..

[123]  Daniel G. Brown,et al.  Ancestral sequence alignment under optimal conditions , 2005, BMC Bioinformatics.

[124]  Inna Dubchak,et al.  Glocal alignment: finding rearrangements during alignment , 2003, ISMB.

[125]  Rolf Backofen,et al.  Backofen R: MARNA: multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons , 2005 .

[126]  Tim J. P. Hubbard,et al.  What can we learn from noncoding regions of similarity between genomes? , 2003, BMC Bioinformatics.

[127]  James A. Cuff,et al.  Genome sequence, comparative analysis and haplotype structure of the domestic dog , 2005, Nature.

[128]  Mathieu Blanchette,et al.  FootPrinter3: phylogenetic footprinting in partially alignable sequences , 2006, Nucleic Acids Res..

[129]  P. Pevzner,et al.  Genome rearrangements in mammalian evolution: lessons from human and mouse genomes. , 2003, Genome research.

[130]  Ari Löytynoja,et al.  SOAP, cleaning multiple alignments from unstable blocks , 2001, Bioinform..

[131]  Francesca Chiaromonte,et al.  Evaluation of regulatory potential and conservation scores for detecting cis-regulatory modules in aligned mammalian genome sequences. , 2005, Genome research.