Progressive alignment with Cactus: a multiple-genome aligner for the thousand-genome era

Cactus, a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequence. We describe progressive extensions to Cactus that enable reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We show that Cactus is capable of scaling to hundreds of genomes and beyond by describing results from an alignment of over 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment yet created. Further, we show improvements in orthology resolution leading to downstream improvements in annotation.

[1]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[2]  Jonas Korlach,et al.  De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads , 2017, GigaScience.

[3]  S. Jeffery Evolution of Protein Molecules , 1979 .

[4]  Brent S. Pedersen,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, Nature Biotechnology.

[5]  Juan Carlos Castilla-Rubio,et al.  Earth BioGenome Project: Sequencing life for the future of life , 2018, Proceedings of the National Academy of Sciences.

[6]  David Haussler,et al.  High-resolution comparative analysis of great ape genomes , 2018, Science.

[7]  David Haussler,et al.  Alignathon: a competitive assessment of whole-genome alignment methods , 2014, bioRxiv.

[8]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.

[9]  Mario Stanke,et al.  Simultaneous gene finding in multiple genomes , 2016, Bioinform..

[10]  Laurie Gordon,et al.  A comprehensive catalog of human KRAB-associated zinc finger genes: insights into the evolutionary history of a large family of transcriptional repressors. , 2006, Genome research.

[11]  David Haussler,et al.  Building a Pan-Genome Reference for a Population , 2015, J. Comput. Biol..

[12]  David Haussler,et al.  Cactus: Algorithms for genome multiple sequence alignment. , 2011, Genome research.

[13]  David Haussler,et al.  Comparative assembly hubs: Web-accessible browsers for comparative genomics , 2013, Bioinform..

[14]  Joel Armstrong,et al.  Whole-Genome Alignment and Comparative Annotation. , 2019, Annual review of animal biosciences.

[15]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.

[16]  Joseph Felsenstein,et al.  Maximum Likelihood and Minimum-Steps Methods for Estimating Evolutionary Trees from Data on Discrete Characters , 1973 .

[17]  M. Diekhans,et al.  Genomic legacy of the African cheetah, Acinonyx jubatus , 2015, Genome Biology.

[18]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[19]  Yipeng Wang,et al.  The wisdom of the commons: ensemble tree classifiers for prostate cancer prognosis , 2009, Bioinform..

[20]  Mark Gerstein,et al.  Sixteen diverse laboratory mouse reference genomes define strain specific haplotypes and novel functional loci , 2018, Nature Genetics.

[21]  Robert S. Harris,et al.  Improved pairwise alignment of genomic dna , 2007 .

[22]  David Haussler,et al.  Long-read sequence assembly of the gorilla genome , 2016, Science.

[23]  M. Kreitman,et al.  Variation in the ratio of nucleotide substitution and indel rates across genomes in mammals and bacteria. , 2009, Molecular biology and evolution.

[24]  J. Townsend,et al.  A comprehensive phylogeny of birds (Aves) using targeted next-generation DNA sequencing , 2015, Nature.

[25]  Xiaohui Xie,et al.  Identifying novel constrained elements by exploiting biased substitution patterns , 2009, Bioinform..

[26]  Deanna M. Church,et al.  Assembly: a resource for assembled genomes at NCBI , 2015, Nucleic Acids Res..

[27]  Adam C. Siepel,et al.  PHAST and RPHAST: phylogenetic analysis with space/time models , 2011, Briefings Bioinform..

[28]  David Haussler,et al.  Comparative Annotation Toolkit (CAT)—simultaneous clade and personal genome annotation , 2017, bioRxiv.

[29]  Meganathan P. Ramakodi,et al.  Three crocodilian genomes reveal ancestral patterns of evolution among archosaurs , 2014, Science.

[30]  Ian T. Fiddes Comparative Annotation Toolkit (CAT) - Simultaneous Clade and Personal Genome Annotation , 2018, Genome research.

[31]  Md. Shamsuzzoha Bayzid,et al.  Whole-genome analyses resolve early branches in the tree of life of modern birds , 2014, Science.

[32]  William Jones,et al.  Variation graph toolkit improves read mapping by representing genetic variation in the reference , 2018, Nature Biotechnology.

[33]  David Haussler,et al.  HAL: a hierarchical format for storing and analyzing multiple genome alignments , 2013, Bioinform..

[34]  S. O’Brien,et al.  The Genome 10K Project: a way forward. , 2015, Annual review of animal biosciences.

[35]  R. Doolittle,et al.  Progressive sequence alignment as a prerequisitetto correct phylogenetic trees , 2007, Journal of Molecular Evolution.

[36]  Haley R Pipkins,et al.  Polyamine transporter potABCD is required for virulence of encapsulated but not nonencapsulated Streptococcus pneumoniae , 2017, PloS one.

[37]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[38]  Hugh E. Olsen,et al.  The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community , 2016, Genome Biology.

[39]  Andreas R. Pfenning,et al.  Comparative genomics reveals insights into avian genome evolution and adaptation , 2014, Science.

[40]  N. Perna,et al.  progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement , 2010, PloS one.

[41]  S. Turner,et al.  Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[42]  Mary Goldman,et al.  Toil enables reproducible, open source, big biomedical data analyses , 2017, Nature Biotechnology.