Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

BackgroundThe process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly.ResultsIn Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies.ConclusionsMany current genome assemblers produced useful assemblies, containing a significant representation of their genes and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another.

Inanç Birol | Dominique Lavenier | Hamidreza Chitsaz | Sergey Koren | Nuno A. Fonseca | David Haussler | François Laviolette | Zhenyu Li | Yingrui Li | Guillaume Chapuis | Dent Earl | Benedict Paten | Hao Zhang | Rayan Chikhi | Francesco Vezzi | Yue Liu | Siu-Ming Yiu | Jacques Corbeil | Henry Song | Jay Shendure | Richard Durbin | Adam M Phillippy | Ruibang Luo | Jun Wang | Ganeshkumar Ganapathy | Tak-Wah Lam | Ted Sharpe | Fedor Tsarev | Riccardo Vicedomini | Simone Scalabrin | Alexey Sergushichev | Binghang Liu | Jianying Yuan | Yujian Shi | Sébastien Boisvert | James R. Knight | Zemin Ning | Michael C Schatz | Steve Goldstein | Shiguo Zhou | Jacob O Kitzman | Sante Gnerre | Sergey Melnikov | Shuangye Yin | Paul J Kersey | Matthias Haimel | Jared T Simpson | Scott Emrich | Michael Place | Kim C Worley | Richard A Gibbs | Anton Alexandrov | Ian F Korf | Delphine Naquin | Shaun D Jackman | Nicolas Maillet | Guojie Zhang | Isaac Y. Ho | Matthew D MacManes | Xiang Qin | David C Schwartz | T Roderick Docking | Stephen Richards | Keith R Bradnam | Joseph N Fass | Paul Baranay | Michael Bechner | Jarrod A Chapman | Wen-Chi Chou | Cristian Del Fabbro | Pavel Fedotov | Nuno A Fonseca | Élénie Godzaridis | Giles Hall | Joseph B Hiatt | Isaac Y Ho | Jason Howard | Martin Hunt | David B Jaffe | Erich D Jarvis | Huaiyang Jiang | Sergey Kazakov | James R Knight | Iain MacCallum | Thomas D Otto | Octávio S Paulo | Francisco Pina-Martins | Dariusz Przybylski | Carson Qu | Filipe J Ribeiro | Daniel S Rokhsar | J Graham Ruby | Timothy I Shaw | Bruno M Vieira | D. Haussler | R. Durbin | M. Schatz | J. Kitzman | J. Shendure | S. Koren | A. Phillippy | Shuangye Yin | Ted Sharpe | Iain Maccallum | D. Jaffe | J. Hiatt | R. Gibbs | J. Corbeil | Jun Wang | François Laviolette | H. Chitsaz | Z. Ning | J. Simpson | I. Birol | D. Schwartz | Ruibang Luo | B. Paten | K. Worley | Huaiyang Jiang | S. Gnerre | J. Chapman | Dariusz Przybylski | P. Kersey | E. Jarvis | I. Korf | Yue Liu | X. Qin | D. Rokhsar | Steve Goldstein | Michael Place | Jianying Yuan | R. Chikhi | T. Lam | D. Lavenier | Filipe J. Ribeiro | Giles Hall | S. Yiu | M. Haimel | Carson Qu | T. Otto | M. MacManes | K. Bradnam | Joseph N. Fass | T. R. Docking | Guillaume Chapuis | D. Naquin | Nicolas Maillet | Wen-Chi Chou | T. Shaw | J. Ruby | Binghang Liu | F. Vezzi | O. Paulo | Ganeshkumar Ganapathy | Hao Zhang | Martin Hunt | Cristian Del Fabbro | S. Scalabrin | S. Emrich | Yujian Shi | A. Sergushichev | Jason T. Howard | Henry Song | Zhenyu Li | Élénie Godzaridis | R. Vicedomini | Michael Bechner | S. Melnikov | S. Kazakov | F. Tsarev | F. Pina-Martins | Guojie Zhang | S. Richards | Dent Earl | Shiguo Zhou | Yingrui Li | Sébastien Boisvert | S. Jackman | Pavel Fedotov | A. Alexandrov | Paul Baranay | Bruno M. Vieira | Sergey Koren | T. Docking | H. Song | J. Howard

[1]  G. Hong,et al.  Nucleic Acids Research , 2015, Nucleic Acids Research.

[2]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[3]  Mihai Pop,et al.  On using optical maps for genome assembly , 2011, Genome Biology.

[4]  Joshua M. Stuart,et al.  Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. , 2009, The Journal of heredity.

[5]  Josephine A. Reinhardt,et al.  De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae. , 2009, Genome research.

[6]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[7]  Robert S. Harris,et al.  Improved pairwise alignment of genomic dna , 2007 .

[8]  François Laviolette,et al.  Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies , 2010, J. Comput. Biol..

[9]  B. Mishra,et al.  Feature-by-Feature – Evaluating De Novo Sequence Assembly , 2012, PloS one.

[10]  Steven L Salzberg,et al.  Detection and correction of false segmental duplications caused by genome mis-assembly , 2010, Genome Biology.

[11]  Bud Mishra,et al.  Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons , 2012, PloS one.

[12]  M. Baker De novo genome assembly: what every biologist should know , 2012, Nature Methods.

[13]  Ewan Birney,et al.  Assemblies: the good, the bad, the ugly , 2010, Nature Methods.

[14]  Nicola Illing,et al.  Next generation shotgun sequencing and the challenges of de novo genome assembly , 2012 .

[15]  Konrad H. Paszkiewicz,et al.  De novo assembly of short sequence reads , 2010, Briefings Bioinform..

[16]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Richard R. Copley,et al.  Scaffolding low quality genomes using orthologous protein sequences , 2012, Bioinform..

[18]  Gautier Koscielny,et al.  Ensembl 2012 , 2011, Nucleic Acids Res..

[19]  Dawei Li,et al.  The sequence and de novo assembly of the giant panda genome , 2010, Nature.

[20]  Alvaro J. González,et al.  The Medicago Genome Provides Insight into the Evolution of Rhizobial Symbioses , 2011, Nature.

[21]  Darío Guerrero-Fernández,et al.  Why Assembling Plant Genome Sequences Is So Challenging , 2012, Biology.

[22]  Juan J de Pablo,et al.  A microfluidic system for large DNA molecule arrays. , 2004, Analytical chemistry.

[23]  M. Schatz,et al.  Current challenges in de novo plant genome sequencing and assembly , 2012, Genome Biology.

[24]  D. Bentley,et al.  Whole-genome re-sequencing. , 2006, Current opinion in genetics & development.

[25]  Andrew C. Adey,et al.  Haplotype-resolved genome sequencing of a Gujarati Indian individual , 2011, Nature Biotechnology.

[26]  David C. Schwartz,et al.  High-resolution human genome structure by single-molecule analysis , 2010, Proceedings of the National Academy of Sciences.

[27]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[28]  G. Turner,et al.  Extreme microallopatric divergence in a cichlid species from Lake Malawi , 2002, Molecular ecology.

[29]  M. Yandell,et al.  A beginner's guide to eukaryotic genome annotation , 2012, Nature Reviews Genetics.

[30]  M. Brent,et al.  Recent advances in gene structure prediction. , 2004, Current opinion in structural biology.

[31]  Alberto Policriti,et al.  GAM: Genomic Assemblies Merger: A Graph Based Method to Integrate Different Assemblies , 2009, 2009 IEEE International Conference on Bioinformatics and Biomedicine.

[32]  Richard A. Seigel,et al.  3. Population and Conservation Genetics , 2009 .

[33]  David N. Messina,et al.  Evolutionary and Biomedical Insights from the Rhesus Macaque Genome , 2007, Science.

[34]  Allan Balmain,et al.  Network analysis of skin tumor progression identifies a rewired genetic architecture affecting inflammation and tumor susceptibility , 2011, Genome Biology.

[35]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[36]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[37]  Yi Yang,et al.  Alignment of Optical Maps , 2005, RECOMB.

[38]  Mark L. Blaxter,et al.  959 Nematode Genomes: a semantic wiki for coordinating sequencing projects , 2011, Nucleic Acids Res..

[39]  Miron Livny,et al.  Validation of rice genome sequence by optical mapping , 2007, BMC Genomics.

[40]  David C. Schwartz,et al.  A Single Molecule Scaffold for the Maize Genome , 2009, PLoS genetics.

[41]  Joshua M. Akey Analysis of 2,440 human exomes highlights the evolution and functional impact of rare coding variation , 2011, Genome Biology.

[42]  S. Young,et al.  Dynamo maker ready to roll , 2011, Nature.

[43]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[44]  I. Kornfield,et al.  Hybrid origin of a cichlid population in Lake Malawi: implications for genetic variation and species diversity , 2003, Molecular ecology.

[45]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[46]  Dawn H. Nagel,et al.  The B73 Maize Genome: Complexity, Diversity, and Dynamics , 2009, Science.

[47]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[48]  Heng Li,et al.  Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly , 2012, Bioinform..

[49]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[50]  German Tischler,et al.  Next-generation sequencing and large genome assemblies. , 2012, Pharmacogenomics.

[51]  Sameer Soi,et al.  Genetic adaptation to high altitude in the Ethiopian highlands , 2011, Genome Biology.

[52]  Sara Sheehan,et al.  Telescoper: de novo assembly of highly repetitive regions , 2012, Bioinform..

[53]  James R. Knight,et al.  High-coverage sequencing and annotated assemblies of the budgerigar genome , 2014, GigaScience.

[54]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[55]  Siegfried E. Drewes,et al.  Natural products research in South Africa: 1890–2010 , 2012 .

[56]  M. Berriman,et al.  REAPR: a universal tool for genome assembly evaluation , 2013, Genome Biology.

[57]  B. Mishra,et al.  Comparing De Novo Genome Assembly: The Long and Short of It , 2011, PloS one.

[58]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[59]  Peter A. Meric,et al.  Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse , 2009, PLoS biology.

[60]  Roy D. Sleator,et al.  An overview of the current status of eukaryote gene prediction strategies. , 2010, Gene.

[61]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[62]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[63]  Veli Mäkinen,et al.  Normalized N50 assembly metric using gap-restricted co-linear chaining , 2012, BMC Bioinformatics.

[64]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[65]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[66]  Yeting Zhang,et al.  A genome triplication associated with early diversification of the core eudicots , 2012, Genome Biology.

[67]  Bruno Baur,et al.  Population and conservation genetics. , 2001 .

[68]  Keith Bradnam,et al.  Assessing the gene space in draft genomes , 2008, Nucleic acids research.

[69]  Keith Bradnam,et al.  CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes , 2007, Bioinform..