A haplotype-resolved, de novo genome assembly for the wood tiger moth (Arctia plantaginis) through trio binning

Background Diploid genome assembly is typically impeded by heterozygosity, as it introduces errors when haplotypes are collapsed into a consensus sequence. Trio binning offers an innovative solution which exploits heterozygosity for assembly. Short, parental reads are used to assign parental origin to long reads from their F1 offspring before assembly, enabling complete haplotype resolution. Trio binning could therefore provide an effective strategy for assembling highly heterozygous genomes which are traditionally problematic, such as insect genomes. This includes the wood tiger moth (Arctia plantaginis), which is an evolutionary study system for warning colour polymorphism. Findings We produced a high-quality, haplotype-resolved assembly for Arctia plantaginis through trio binning. We sequenced a same-species family (F1 heterozygosity ∼1.9%) and used parental Illumina reads to bin 99.98% of offspring Pacific Biosciences reads by parental origin, before assembling each haplotype separately and scaffolding with 10X linked-reads. Both assemblies are highly contiguous (mean scaffold N50: 8.2Mb) and complete (mean BUSCO completeness: 97.3%), with complete annotations and 31 chromosomes identified through karyotyping. We employed the assembly to analyse genome-wide population structure and relationships between 40 wild resequenced individuals from five populations across Europe, revealing the Georgian population as the most genetically differentiated with the lowest genetic diversity. Conclusions We present the first invertebrate genome to be assembled via trio binning. This assembly is one of the highest quality genomes available for Lepidoptera, supporting trio binning as a potent strategy for assembling highly heterozygous genomes. Using this assembly, we provide genomic insights into geographic population structure of Arctia plantaginis.

[1]  Timothy P. L. Smith,et al.  Continuous chromosome-scale haplotypes assembled from a single interspecies F1 hybrid of yak and cattle , 2020, GigaScience.

[2]  Sergey Koren,et al.  Merqury: reference-free quality and phasing assessment for genome assemblies , 2020, bioRxiv.

[3]  Timothy P. L. Smith,et al.  Chromosome-length haplotigs for yak and cattle from trio binning assembly of an F1 hybrid , 2019, bioRxiv.

[4]  Sergey Koren,et al.  Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome , 2019, Nature Biotechnology.

[5]  A. Fujiyama,et al.  High-quality genome assembly of the silkworm, Bombyx mori. , 2019, Insect biochemistry and molecular biology.

[6]  A. Bartoňová,et al.  Cross-continental phylogeography of two Holarctic Nymphalid butterflies, Boloria eunomia and Boloria selene , 2019, PloS one.

[7]  Z. Fei,et al.  A high‐quality chromosome‐level genome assembly of a generalist herbivore, Trichoplusia ni , 2019, Molecular ecology resources.

[8]  Heng Li,et al.  Fast and accurate long-read assembly with wtdbg2 , 2019, Nature Methods.

[9]  Evgeny M. Zdobnov,et al.  OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs , 2018, Nucleic Acids Res..

[10]  Yasubumi Sakakibara,et al.  Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data , 2017, Briefings Bioinform..

[11]  Sergey Koren,et al.  De novo assembly of haplotype-resolved genomes with trio binning , 2018, Nature Biotechnology.

[12]  Shilpa Garg,et al.  A graph-based approach to diploid genome assembly , 2018, Bioinform..

[13]  Timothy P. L. Smith,et al.  FALCON-Phase: Integrating PacBio and Hi-C data for phased diploid genomes , 2018, bioRxiv.

[14]  Sergey Koren,et al.  Integrating Hi-C links with assembly graphs for chromosome-scale assembly , 2018, bioRxiv.

[15]  A. Kawahara,et al.  Lepidoptera genomes: current knowledge, gaps and future directions. , 2018, Current opinion in insect science.

[16]  A. Lemmon,et al.  Resolving Relationships among the Megadiverse Butterflies and Moths with a Novel Pipeline for Anchored Phylogenomics , 2018, Systematic biology.

[17]  Mauricio O. Carneiro,et al.  Scaling accurate genetic variant discovery to tens of thousands of samples , 2017, bioRxiv.

[18]  Han Fang,et al.  GenomeScope: Fast reference-free genome profiling from short reads , 2016, bioRxiv.

[19]  Richard J. Challis,et al.  A high-coverage draft genome of the mycalesine butterfly Bicyclus anynana , 2017, GigaScience.

[20]  J. Mappes,et al.  De novo transcriptome assembly and its annotation for the aposematic wood tiger moth (Parasemia plantaginis) , 2017, Genomics data.

[21]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[22]  S. Oliver,et al.  Estimating the total number of phosphoproteins and phosphorylation sites in eukaryotic proteomes , 2017, GigaScience.

[23]  Nancy F. Chen,et al.  Deconstructing isolation-by-distance: The genomic consequences of limited dispersal , 2016, bioRxiv.

[24]  J. Mappes,et al.  Putting Parasemia in its phylogenetic place: a molecular analysis of the subtribe Arctiina (Lepidoptera) , 2016 .

[25]  Michael C. Schatz,et al.  Assemblytics: a web analytics tool for the detection of variants from an assembly , 2016, Bioinform..

[26]  Yang Wang,et al.  Multifaceted biological insights from a draft genome sequence of the tobacco hornworm moth, Manduca sexta. , 2016, Insect biochemistry and molecular biology.

[27]  Daniel Mapleson,et al.  KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies , 2016, bioRxiv.

[28]  M. Blaxter,et al.  Lepbase: the Lepidopteran genome database , 2016, bioRxiv.

[29]  William Chow,et al.  gEVAL — a web-based browser for evaluating genome assemblies , 2016, bioRxiv.

[30]  Robert D. Finn,et al.  The Dfam database of repetitive DNA families , 2015, Nucleic Acids Res..

[31]  J. Mallet,et al.  Major Improvements to the Heliconius melpomene Genome Assembly Used to Confirm 10 Chromosome Fusion Events in 6 Million Years of Butterfly Evolution , 2015, G3: Genes, Genomes, Genetics.

[32]  Toni Gabaldón,et al.  Redundans: an assembly pipeline for highly heterozygous genomes , 2015, Nucleic acids research.

[33]  Katharina J. Hoff,et al.  BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS , 2016, Bioinform..

[34]  Evgeny M. Zdobnov,et al.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs , 2015, Bioinform..

[35]  J. Mappes,et al.  Global phylogeography and geographical variation in warning coloration of the wood tiger moth (Parasemia plantaginis) , 2015 .

[36]  J. Mappes,et al.  Temporal relationship between genetic and warning signal variation in the aposematic wood tiger moth (Parasemia plantaginis) , 2014, Molecular ecology.

[37]  Liisa Holm,et al.  The Glanville fritillary genome retains an ancient karyotype and reveals selective chromosomal fusions in Lepidoptera , 2014, Nature Communications.

[38]  A. Kawahara,et al.  Phylogenomics provides strong evidence for relationships of butterflies and moths , 2014, Proceedings of the Royal Society B: Biological Sciences.

[39]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[40]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[41]  H. Ellegren Genome sequencing and population genomics in non-model organisms. , 2014, Trends in ecology & evolution.

[42]  M. Dalíková,et al.  Chromosomal Evolution in Tortricid Moths: Conserved Karyotypes with Diverged Features , 2013, PloS one.

[43]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[44]  Shuai Zhan,et al.  MonarchBase: the monarch butterfly genome database , 2012, Nucleic Acids Res..

[45]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[46]  David Levine,et al.  A high-performance computing toolset for relatedness and principal component analysis of SNP data , 2012, Bioinform..

[47]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[48]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[49]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[50]  M. Stevens,et al.  Direction and strength of selection by predators for the color of the aposematic wood tiger moth , 2011 .

[51]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[52]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[53]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[54]  A C C Gibbs,et al.  Data Analysis , 2009, Encyclopedia of Database Systems.

[55]  J. Birchler,et al.  Sensitive fluorescence in situ hybridization signal detection in maize using directly labeled probes produced by high concentration DNA polymerase nick translation , 2006, Biotechnic & histochemistry : official publication of the Biological Stain Commission.

[56]  Jill P Mesirov,et al.  Assembly of polymorphic genomes: algorithms and application to Ciona savignyi. , 2005, Genome research.

[57]  F. Marec,et al.  Resolution of sex chromosome constitution by genomic in situ hybridization and fluorescence in situ hybridization with (TTAGG)n telomeric probe in some species of Lepidoptera , 2005, Chromosoma.

[58]  Pavel A. Pevzner,et al.  De novo identification of repeat families in large genomes , 2005, ISMB.

[59]  A. Murakami,et al.  Cytological evidence for holocentric chromosomes of the silkworms, Bombyx mori and B. mandarina, (Bombycidae, Lepidoptera) , 2004, Chromosoma.

[60]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[61]  N. P. Kristensen Lepidoptera, moths and butterflies , 1999 .

[62]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[63]  R. de Wachter,et al.  Extraction of high molecular weight DNA from molluscs. , 2002 .

[64]  H. Chandler Database , 1985 .

[65]  J. Dawson Handbook of zoology , 1870 .