Alignathon: A competitive assessment of whole genome alignment methods

Background Multiple sequence alignments (MSAs) are a prerequisite for a wide variety of evolutionary analyses. Published assessments and benchmark datasets for protein and, to a lesser extent, global nucleotide MSAs are available, but less effort has been made to establish benchmarks in the more general problem of whole genome alignment (WGA). Results Using the same model as the successful Assemblathon competitions, we organized a competitive evaluation in which teams submitted their alignments, and assessments were performed collectively after all the submissions were received. Three datasets were used: two of simulated primate and mammalian phylogenies, and one of 20 real fly genomes. In total 35 submissions were assessed, submitted by ten teams using 12 different alignment pipelines. Conclusions We found agreement between independent simulation-based and statistical assessments, indicating that there are substantial accuracy differences between contemporary alignment tools. We saw considerable difference in the alignment quality of differently annotated regions, and found few tools aligned the duplications analysed. We found many tools worked well at shorter evolutionary distances, but fewer performed competitively at longer distances. We provide all datasets, submissions and assessment programs for further study, and provide, as a resource for future benchmarking, a convenient repository of code and data for reproducing the simulation assessments.

[1]  Daniel J. Blankenberg,et al.  28-way vertebrate alignment and conservation track in the UCSC Genome Browser. , 2007, Genome research.

[2]  G. Gonnet,et al.  ALF—A Simulation Framework for Genome Evolution , 2011, Molecular biology and evolution.

[3]  Ian Holmes,et al.  Dynamic programming alignment accuracy , 1998, RECOMB '98.

[4]  C. Notredame,et al.  Using multiple alignment methods to assess the quality of genomic data analysis. , 2003 .

[5]  Andrew D. Smith,et al.  SIMPROT: Using an empirically determined indel distribution in simulations of protein evolution , 2005, BMC Bioinformatics.

[6]  Jian Ma,et al.  PSAR: Measuring Multiple Sequence Alignment Reliability by Probabilistic Sampling - (Extended Abstract) , 2011, RECOMB.

[7]  Cory L. Strope,et al.  indel-Seq-Gen: a new protein family simulator incorporating domains, motifs, and indels. , 2006, Molecular biology and evolution.

[8]  Robert G. Beiko,et al.  A simulation test bed for hypotheses of genome evolution , 2007, Bioinform..

[9]  Benedict Paten,et al.  Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment , 2009, Bioinform..

[10]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[11]  F. Blattner,et al.  Mauve: multiple alignment of conserved genomic sequence with rearrangements. , 2004, Genome research.

[12]  Dan Graur,et al.  Local Reliability Measures from Sets of Co-Optimal Multiple Sequence Alignments , 2007, Pacific Symposium on Biocomputing.

[13]  Andreas Wilm,et al.  An enhanced RNA alignment benchmark for sequence alignment programs , 2006, Algorithms for Molecular Biology.

[14]  Lior Pachter,et al.  VISTA: computational tools for comparative genomics , 2004, Nucleic Acids Res..

[15]  Albert J. Vilella,et al.  A high-resolution map of human evolutionary constraint using 29 mammals , 2011, Nature.

[16]  Jian Ma,et al.  PSAR-Align: improving multiple sequence alignment using probabilistic sampling , 2014, Bioinform..

[17]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[18]  Folker Meyer,et al.  Generating Benchmarks for Multiple Sequence Alignments and Phylogenic Reconstructions , 1997, ISMB.

[19]  Colin N. Dewey,et al.  Aligning multiple whole genomes with Mercator and MAVID. , 2007, Methods in molecular biology.

[20]  Steven Salzberg,et al.  Mugsy: fast multiple alignment of closely related whole genomes , 2010, Bioinform..

[21]  Antonio Carvajal-Rodríguez,et al.  Simulation of Genes and Genomes Forward in Time , 2010, Current genomics.

[22]  Elisabeth R. M. Tillier,et al.  The accuracy of several multiple sequence alignment programs for proteins , 2006, BMC Bioinformatics.

[23]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[24]  N. Perna,et al.  progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement , 2010, PloS one.

[25]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[26]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[27]  M. Suchard,et al.  Alignment Uncertainty and Genomic Analysis , 2008, Science.

[28]  D. Higgins,et al.  Multiple sequence alignments. , 2005, Current opinion in structural biology.

[29]  Lode Wyns,et al.  SABmark- a benchmark for sequence alignment that covers the entire known fold space , 2005, Bioinform..

[30]  Reed A. Cartwright,et al.  DNA assembly with gaps (Dawg): simulating sequence evolution , 2005, Bioinform..

[31]  Marc A. Martí-Renom,et al.  Using tertiary structure for the computation of highly accurate multiple RNA alignments with the SARA-Coffee package , 2013, Bioinform..

[32]  Folker Meyer,et al.  Rose: generating sequence families , 1998, Bioinform..

[33]  Ophir Frieder,et al.  On understanding and classifying web queries , 2006 .

[34]  Desmond G. Higgins,et al.  Analysis and Comparison of Benchmarks for Multiple Sequence Alignment , 2006, Silico Biol..

[35]  Melanie A. Huntley,et al.  Evolution of genes and genomes on the Drosophila phylogeny , 2007, Nature.

[36]  Joshua M. Stuart,et al.  Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. , 2009, The Journal of heredity.

[37]  Cédric Notredame,et al.  Upcoming challenges for multiple sequence alignment methods in the high-throughput era , 2009, Bioinform..

[38]  W. Miller,et al.  Mulan: multiple-sequence local alignment and visualization for studying function and evolution. , 2005, Genome research.

[39]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[40]  Rachel Kolodny,et al.  Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. , 2005, Journal of molecular biology.

[41]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[42]  David Haussler,et al.  Cactus: Algorithms for genome multiple sequence alignment. , 2011, Genome research.

[43]  Cédric Notredame,et al.  Recent Evolutions of Multiple Sequence Alignment Algorithms , 2007, PLoS Comput. Biol..

[44]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[45]  Mary Goldman,et al.  The UCSC Genome Browser database: extensions and updates 2013 , 2012, Nucleic Acids Res..

[46]  Michael J. Wise No so HoT – heads or tails is not able to reliably compare multiple sequence alignments , 2010 .

[47]  C. V. Jongeneel,et al.  The M-Coffee web server: a meta-method for computing multiple sequence alignments by combining alternative alignment methods , 2007, Nucleic Acids Res..

[48]  Inna Dubchak,et al.  Multiple whole-genome alignments without a reference organism. , 2009, Genome research.

[49]  Colin N. Dewey,et al.  Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. , 2007, Genome research.

[50]  Tal Pupko,et al.  GUIDANCE: a web server for assessing alignment confidence scores , 2010, Nucleic Acids Res..

[51]  Ian Holmes,et al.  Evolutionary Modeling and Prediction of Non-Coding RNAs in Drosophila , 2009, PloS one.

[52]  Tal Pupko,et al.  An alignment confidence score capturing robustness to guide tree uncertainty. , 2010, Molecular biology and evolution.

[53]  Carsten Wiuf,et al.  Gene Genealogies, Variation and Evolution - A Primer in Coalescent Theory , 2004 .

[54]  E. Birney,et al.  Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. , 2008, Genome research.

[55]  Manuel Gil,et al.  Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment. , 2012, Methods in molecular biology.

[56]  J. Wakeley Coalescent Theory: An Introduction , 2008 .

[57]  Susan J. Brown,et al.  The i5K Initiative: advancing arthropod genomics for knowledge, human health, agriculture, and the environment. , 2013, The Journal of heredity.

[58]  Xiaoyu Chen,et al.  Comparative assessment of methods for aligning multiple genome sequences , 2010, Nature Biotechnology.

[59]  I. Holmes,et al.  Tools for simulating evolution of aligned genomic regions with integrated parameter estimation , 2008, Genome Biology.