From Short Reads to Chromosome-Scale Genome Assemblies.

A high-quality, annotated genome assembly is the foundation for many downstream studies. However, obtaining such an assembly is a complex, reiterative process that requires the assimilation of high-quality data and combines different approaches and data types. While some software packages incorporating multiple steps of genome assembly are commercially available, they may not be flexible enough to be routinely applied to all organisms, particularly to nonmodel species such as pathogenic oomycetes and fungi. If researchers understand and apply the most appropriate, currently available tools for each step, it is possible to customize parameters and optimize results for their organism of study. Based on our experience of de novo assembly and annotation of several oomycete species, this chapter provides a modular workflow from processing of raw reads, to initial assembly generation, through optimization, chromosome-scale scaffolding and annotation, outlining input and output data as well as examples and alternative software used for each step. The accompanying Notes provide background information for each step as well as alternative options. The final result of this workflow could be an annotated, high-quality, validated, chromosome-scale assembly or a draft assembly of sufficient quality to meet specific needs of a project.

[1]  Michael Roberts,et al.  The MaSuRCA genome assembler , 2013, Bioinform..

[2]  R. Gibbs,et al.  Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology , 2012, PloS one.

[3]  Shengfeng Huang,et al.  HaploMerger2: rebuilding both haploid sub-assemblies from high-heterozygosity diploid genome assembly , 2017, Bioinform..

[4]  Carol Soderlund,et al.  SyMAP v3.4: a turnkey synteny system with application to plant genomes , 2011, Nucleic acids research.

[5]  Prapat Suriyaphol,et al.  Draft Genome Sequence of the Pathogenic Oomycete Pythium insidiosum Strain Pi-S, Isolated from a Patient with Pythiosis , 2015, Genome Announcements.

[6]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[7]  Janna L. Fierst,et al.  Using linkage maps to correct and scaffold de novo genome assemblies: methods, challenges, and computational tools , 2015, Front. Genet..

[8]  S. Koren,et al.  Scaffolding of long read assemblies using long range contact information , 2016, BMC Genomics.

[9]  Jonathan D. G. Jones,et al.  Genome sequence and analysis of the Irish potato famine pathogen Phytophthora infestans , 2009, Nature.

[10]  Walter Pirovano,et al.  BIOINFORMATICS APPLICATIONS , 2022 .

[11]  Walter Pirovano,et al.  SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information , 2014, BMC Bioinformatics.

[12]  Neva C. Durand,et al.  De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds , 2016, Science.

[13]  Richard M. Leggett,et al.  NextClip: an analysis and read preparation tool for Nextera Long Mate Pair libraries , 2013, Bioinform..

[14]  Sven Rahmann,et al.  Snakemake--a scalable bioinformatics workflow engine. , 2012, Bioinformatics.

[15]  Deacon J. Sweeney,et al.  Sequencing and automated whole-genome optical mapping of the genome of a domestic goat (Capra hircus) , 2012, Nature Biotechnology.

[16]  Han Fang,et al.  GenomeScope: Fast reference-free genome profiling from short reads , 2016, bioRxiv.

[17]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[18]  Alejandro A. Schäffer,et al.  WindowMasker: window-based masker for sequenced genomes , 2006, Bioinform..

[19]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[20]  Chengxi Ye,et al.  DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies , 2014, Scientific Reports.

[21]  Robert J. Schmitz,et al.  Widespread adenine N6-methylation of active genes in fungi , 2017, Nature Genetics.

[22]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.

[23]  Jordan M. Eizenga,et al.  Mapping DNA Methylation with High Throughput Nanopore Sequencing , 2017, Nature Methods.

[24]  Ian Korf,et al.  Gene finding in novel genomes , 2004, BMC Bioinformatics.

[25]  Steven J. M. Jones,et al.  LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads , 2015, GigaScience.

[26]  Graham J. Etherington,et al.  From pathogen genomes to host plant processes: the power of plant parasitic oomycetes , 2013, Genome Biology.

[27]  Mark Borodovsky,et al.  GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses , 2005, Nucleic Acids Res..

[28]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[29]  Toni Gabaldón,et al.  Redundans: an assembly pipeline for highly heterozygous genomes , 2015, Nucleic acids research.

[30]  Andrew C. Adey,et al.  Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions , 2013, Nature Biotechnology.

[31]  T. Ryan Gregory,et al.  Eukaryotic genome size databases , 2006, Nucleic Acids Res..

[32]  Jonathan M D Wood,et al.  Using optical mapping data for the improvement of vertebrate genome assemblies , 2015, GigaScience.

[33]  Paul Medvedev,et al.  Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[34]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[35]  Ning Jiang,et al.  Overview of repeat annotation and de novo repeat identification. , 2013, Methods in molecular biology.

[36]  James G. Baldwin-Brown,et al.  Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage , 2016, bioRxiv.

[37]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[38]  Tetsuya Hayashi,et al.  Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads , 2014, Genome research.

[39]  L. Grenville-Briggs,et al.  Draft Genome Sequence of the Mycoparasitic Oomycete Pythium oligandrum Strain CBS 530.74 , 2017, Genome Announcements.

[40]  G. Van den Ackerveken,et al.  Genome analyses of the sunflower pathogen Plasmopara halstedii provide insights into effector evolution in downy mildews and Phytophthora , 2015, BMC Genomics.

[41]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[42]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[43]  Marco Thines,et al.  Signatures of Adaptation to Obligate Biotrophy in the Hyaloperonospora arabidopsidis Genome , 2010, Science.

[44]  Xun Xu,et al.  Genome assembly with in vitro proximity ligation data and whole-genome triplication in lettuce , 2017, Nature Communications.

[45]  M. Yandell,et al.  Genome Annotation and Curation Using MAKER and MAKER‐P , 2014, Current protocols in bioinformatics.

[46]  David A. Streett,et al.  Super deduper, fast PCR duplicate detection in fastq files , 2015, BCB.

[47]  Bernard Henrissat,et al.  Genome sequence of the necrotrophic plant pathogen Pythium ultimum reveals original pathogenicity mechanisms and effector repertoire , 2010, Genome Biology.

[48]  Pedro Olivares-Chauvet,et al.  Capturing pairwise and multi-way chromosomal conformations using chromosomal walks , 2016, Nature.

[49]  James A. Yorke,et al.  QuorUM: An Error Corrector for Illumina Reads , 2013, PloS one.

[50]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[51]  Siu-Ming Yiu,et al.  COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly , 2012, Bioinform..

[52]  Catherine L. Peichel,et al.  Improvement of the threespine stickleback (Gasterosteus aculeatus) genome using a Hi-C-based Proximity-Guided Assembly method , 2016, bioRxiv.

[53]  P Shah,et al.  Improved de novo Genome Assembly: Linked-Read Sequencing Combined with Optical Mapping Produce a High Quality Mammalian Genome at Relatively Low Cost , 2017, bioRxiv.

[54]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[55]  J. Gouzy,et al.  Comparative analysis of expressed CRN and RXLR effectors from two Plasmopara species causing grapevine and sunflower downy mildew , 2016 .

[56]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[57]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[58]  Neva C. Durand,et al.  A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping , 2014, Cell.

[59]  Adel Dayarian,et al.  SOPRA: Scaffolding algorithm for paired reads via statistical optimization , 2010, BMC Bioinformatics.

[60]  Christina A. Cuomo,et al.  Source (or Part of the following Source): Type Article Title Comparative Genomics Reveals Mobile Pathogenicity Chromosomes in Fusarium Author(s) , 2022 .

[61]  M. Freeling,et al.  How to usefully compare homologous plant genes and chromosomes as DNA sequences. , 2008, The Plant journal : for cell and molecular biology.

[62]  H. Bussey,et al.  The nucleotide sequence of chromosome I from Saccharomyces cerevisiae. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[63]  D. Schwartz,et al.  Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. , 1993, Science.

[64]  Jonas Paulsen,et al.  Chrom3D: three-dimensional genome modeling from Hi-C and nuclear lamin-genome contacts , 2017, Genome Biology.

[65]  Yasubumi Sakakibara,et al.  Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data , 2017, Briefings Bioinform..

[66]  René L. Warren,et al.  Sealer: a scalable gap-closing application for finishing draft genomes , 2015, BMC Bioinformatics.

[67]  S. Q. Xie,et al.  Hierarchical folding and reorganization of chromosomes are linked to transcriptional changes in cellular differentiation , 2015, Molecular systems biology.

[68]  Gene E Ananiev,et al.  Optical mapping discerns genome wide DNA methylation profiles , 2008, BMC Molecular Biology.

[69]  Burkhard Morgenstern,et al.  AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints , 2005, Nucleic Acids Res..

[70]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[71]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[72]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[73]  Shilin Chen,et al.  FastUniq: A Fast De Novo Duplicates Removal Tool for Paired Short Reads , 2012, PloS one.

[74]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[75]  Steven Salzberg,et al.  TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders , 2004, Bioinform..

[76]  I. Birol,et al.  ARCS: Assembly Roundup by Chromium Scaffolding , 2017, bioRxiv.

[77]  Bishwo N. Adhikari,et al.  Comparative Genomics Reveals Insight into Virulence Strategies of Plant Pathogenic Oomycetes , 2013, PloS one.

[78]  K. Gunderson,et al.  High-throughput SNP genotyping on universal bead arrays. , 2005, Mutation research.

[79]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[80]  Brendan L. O’Connell,et al.  Chromosome-scale shotgun assembly using an in vitro method for long-range linkage , 2015, Genome research.

[81]  Christina A. Cuomo,et al.  Comparative Genomics Yields Insights into Niche Adaptation of Plant Vascular Wilt Pathogens , 2011, PLoS pathogens.

[82]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[83]  Stinus Lindgreen,et al.  AdapterRemoval v2: rapid adapter trimming, identification, and read merging , 2016, BMC Research Notes.

[84]  Jonathan D. G. Jones,et al.  Gene Gain and Loss during Evolution of Obligate Parasitism in the White Rust Pathogen of Arabidopsis thaliana , 2011, PLoS biology.

[85]  Ying Chen,et al.  MECAT: an ultra-fast mapping, error correction and de novo assembly tool for single-molecule sequencing reads , 2016, bioRxiv.

[86]  M. Sochor,et al.  The largest fungal genome discovered in Jafnea semitosta , 2017, Plant Systematics and Evolution.

[87]  Keith Bradnam,et al.  CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes , 2007, Bioinform..

[88]  Bernardo J. Clavijo,et al.  Improving and correcting the contiguity of long-read genome assemblies of three plant species using optical mapping and chromosome conformation capture data. , 2017, Genome research.

[89]  Ewan Birney,et al.  Automated generation of heuristics for biological sequence comparison , 2005, BMC Bioinformatics.

[90]  L. Grenville-Briggs,et al.  Draft Genome Sequence of the Mycoparasitic Oomycete Pythium periplocum Strain CBS 532.74 , 2017, Genome Announcements.

[91]  Christina A. Cuomo,et al.  The Fusarium graminearum Genome Reveals a Link Between Localized Polymorphism and Pathogen Specialization , 2007, Science.

[92]  James R. Knight,et al.  Genome sequencing and mapping reveal loss of heterozygosity as a mechanism for rapid adaptation in the vegetable pathogen Phytophthora capsici. , 2012, Molecular plant-microbe interactions : MPMI.

[93]  Mihai Pop,et al.  Scaffolding and validation of bacterial genome assemblies using optical restriction maps , 2008, Bioinform..

[94]  M. Berriman,et al.  A comprehensive evaluation of assembly scaffolding tools , 2014, Genome Biology.

[95]  J. Hofkens,et al.  Optical mapping of DNA: Single‐molecule‐based methods for mapping genomes , 2011, Biopolymers.

[96]  Andrew C. Adey,et al.  In vitro, long-range sequence information for de novo genome assembly via transposase contiguity , 2014, Genome research.

[97]  C. Nusbaum,et al.  Comprehensive variation discovery in single human genomes , 2014, Nature Genetics.

[98]  Steven Salzberg,et al.  BIOINFORMATICS ORIGINAL PAPER , 2004 .

[99]  H. S. Shetty,et al.  Draft genome sequence of Sclerospora graminicola, the pearl millet downy mildew pathogen , 2017, Biotechnology reports.

[100]  Evgeny M. Zdobnov,et al.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs , 2015, Bioinform..

[101]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[102]  Tyson A. Clark,et al.  Direct detection of DNA methylation during single-molecule, real-time sequencing , 2010, Nature Methods.

[103]  C. Hollenberg,et al.  The Hansenula polymorpha (strain CBS4732) genome sequencing and analysis. , 2003, FEMS yeast research.

[104]  Eugene Goltsman,et al.  Meraculous-2D: Haplotype-sensitive Assembly of Highly Heterozygous genomes , 2017, 1703.09852.

[105]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[106]  Daniel Mapleson,et al.  KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies , 2016, bioRxiv.

[107]  M. Berriman,et al.  REAPR: a universal tool for genome assembly evaluation , 2013, Genome Biology.

[108]  Daisy E. Pagete An end-to-end assembly of the Aedes aegypti genome , 2016, 1605.04619.

[109]  Jenna M. Lang,et al.  Strain- and plasmid-level deconvolution of a synthetic metagenome by sequencing proximity ligation products , 2014, PeerJ.

[110]  A. Quinlan BEDTools: The Swiss‐Army Tool for Genome Feature Analysis , 2014, Current protocols in bioinformatics.

[111]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[112]  Xun Xu,et al.  Rapid detection of structural variation in a human genome using nanochannel-based genome mapping technology , 2014, GigaScience.

[113]  A. Gnirke,et al.  ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads , 2009, Genome Biology.

[114]  Maitreya J. Dunham,et al.  Species-Level Deconvolution of Metagenome Assemblies with Hi-C–Based Contact Probability Maps , 2014, G3: Genes, Genomes, Genetics.

[115]  J. Dekker,et al.  Capturing Chromosome Conformation , 2002, Science.

[116]  Neva C. Durand,et al.  Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. , 2016, Cell systems.

[117]  D. Smith Goodbye genome paper, hello genome report: the increasing popularity of ‘genome announcements’ and their impact on science , 2016, Briefings in functional genomics.

[118]  Harald Berger,et al.  Draft Genome Sequence of Biocontrol Agent Pythium oligandrum Strain Po37, an Oomycota , 2016, Genome Announcements.

[119]  Laura Baxter,et al.  Phytophthora Genome Sequences Uncover Evolutionary Origins and Mechanisms of Pathogenesis , 2006, Science.

[120]  Jiajie Zhang,et al.  PEAR: a fast and accurate Illumina Paired-End reAd mergeR , 2013, Bioinform..

[121]  Sofia M. C. Robb,et al.  MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. , 2007, Genome research.

[122]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[123]  Keith Bradnam,et al.  UNIX and Perl to the Rescue!: A Field Guide for the Life Sciences (and Other Data-rich Pursuits) , 2012 .

[124]  R. Michelmore,et al.  Genome Sequence and Architecture of the Tobacco Downy Mildew Pathogen Peronospora tabacina. , 2015, Molecular plant-microbe interactions : MPMI.

[125]  Steven G. Schroeder,et al.  Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome , 2017, Nature Genetics.

[126]  Vineet Bafna,et al.  HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies , 2017, Genome research.

[127]  W. Pirovano,et al.  Toward almost closed genomes with GapFiller , 2012, Genome Biology.