Assessing genome assembly quality using the LTR Assembly Index (LAI)

Abstract Assembling a plant genome is challenging due to the abundance of repetitive sequences, yet no standard is available to evaluate the assembly of repeat space. LTR retrotransposons (LTR-RTs) are the predominant interspersed repeat that is poorly assembled in draft genomes. Here, we propose a reference-free genome metric called LTR Assembly Index (LAI) that evaluates assembly continuity using LTR-RTs. After correcting for LTR-RT amplification dynamics, we show that LAI is independent of genome size, genomic LTR-RT content, and gene space evaluation metrics (i.e., BUSCO and CEGMA). By comparing genomic sequences produced by various sequencing techniques, we reveal the significant gain of assembly continuity by using long-read-based techniques over short-read-based methods. Moreover, LAI can facilitate iterative assembly improvement with assembler selection and identify low-quality genomic regions. To apply LAI, intact LTR-RTs and total LTR-RTs should contribute at least 0.1% and 5% to the genome size, respectively. The LAI program is freely available on GitHub: https://github.com/oushujun/LTR_retriever.

[1]  Rikky W. Purbojati,et al.  Correction for Lan et al., Long-read sequencing uncovers the adaptive topography of a carnivorous plant genome , 2017, Proceedings of the National Academy of Sciences.

[2]  Detlef Weigel,et al.  High contiguity Arabidopsis thaliana genome assembly with a single nanopore flow cell , 2018, Nature Communications.

[3]  John F. McDonald,et al.  LTR_STRUC: a novel search and identification program for LTR retrotransposons , 2003, Bioinform..

[4]  Haibao Tang,et al.  Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum , 2015, Nature.

[5]  Jeffrey Ross-Ibarra,et al.  Improved maize reference genome with single-molecule technologies , 2017, Nature.

[6]  Keith Bradnam,et al.  CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes , 2007, Bioinform..

[7]  Y. Li,et al.  Structural features of the rice chromosome 4 centromere. , 2004, Nucleic acids research.

[8]  Zhao Xu,et al.  LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons , 2007, Nucleic Acids Res..

[9]  K. Vandepoele,et al.  Are We There Yet? Reliably Estimating the Completeness of Plant Genome Sequences[OPEN] , 2016, Plant Cell.

[10]  Ning Jiang Plant Transposable Elements , 2016 .

[11]  Sergio Alan Cervantes-Pérez,et al.  Architecture and evolution of a minute plant genome , 2013, Nature.

[12]  Jeremy D. DeBarry,et al.  De novo genome sequencing and comparative genomics of date palm (Phoenix dactylifera) , 2011, Nature Biotechnology.

[13]  Keith Bradnam,et al.  Assessing the gene space in draft genomes , 2008, Nucleic acids research.

[14]  Doreen Ware,et al.  Whole genome de novo assemblies of three divergent strains of rice, Oryza sativa, document novel gene space of aus and indica , 2014, Genome Biology.

[15]  Vikram Bhattacharjee,et al.  Evidence for the contribution of LTR retrotransposons to C. elegans gene evolution. , 2003, Molecular biology and evolution.

[16]  Boas Pucker,et al.  A De Novo Genome Sequence Assembly of the Arabidopsis thaliana Accession Niederzenz-1 Displays Presence/Absence Variation and Strong Synteny , 2016, PloS one.

[17]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[18]  Mihaela M. Martis,et al.  The Sorghum bicolor genome and the diversification of grasses , 2009, Nature.

[19]  Sergey Koren,et al.  De Novo Assembly of a New Solanum pennellii Accession Using Nanopore Sequencing[CC-BY] , 2017, Plant Cell.

[20]  Mark J. P. Chaisson,et al.  Reconstructing complex regions of genomes using long-read sequencing technology , 2014, Genome research.

[21]  Anton J. Enright,et al.  The zebrafish reference genome sequence and its relationship to the human genome , 2013, Nature.

[22]  H. Kanamori,et al.  A BAC physical map of aus rice cultivar 'Kasalath', and the map-based genomic sequence of 'Kasalath' chromosome 1. , 2013, The Plant journal : for cell and molecular biology.

[23]  Ann A. Ferguson,et al.  What makes up plant genomes: The vanishing line between transposable elements and genes. , 2016, Biochimica et biophysica acta.

[24]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[25]  Stefan Kurtz,et al.  LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons , 2008, BMC Bioinformatics.

[26]  Yan Li,et al.  Sequencing and de novo assembly of a near complete indica rice genome , 2017, Nature Communications.

[27]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[28]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[29]  Takuji Sasaki,et al.  The map-based sequence of the rice genome , 2005, Nature.

[30]  Tanya Z. Berardini,et al.  The Arabidopsis Information Resource (TAIR): gene structure and function annotation , 2007, Nucleic Acids Res..

[31]  David M. Goodstein,et al.  Phytozome: a comparative platform for green plant genomics , 2011, Nucleic Acids Res..

[32]  I. Leitch,et al.  First nuclear DNA amounts in more than 300 angiosperms. , 2005, Annals of botany.

[33]  F. Blattner,et al.  Functional Rice Centromeres Are Marked by a Satellite Repeat and a Centromere-Specific Retrotransposon Article, publication date, and citation information can be found at www.plantcell.org/cgi/doi/10.1105/tpc.003079. , 2002, The Plant Cell Online.

[34]  Hiroaki Sakai,et al.  Construction of Pseudomolecule Sequences of the aus Rice Cultivar Kasalath for Comparative Genomics of Asian Cultivated Rice , 2014, DNA research : an international journal for rapid publication of reports on genes and genomes.

[35]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[36]  S. Henikoff,et al.  Sequencing of a rice centromere uncovers active genes , 2004, Nature Genetics.

[37]  Shujun Ou,et al.  LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal Repeat Retrotransposons1[OPEN] , 2017, Plant Physiology.

[38]  J. Bennetzen,et al.  Do genetic recombination and gene density shape the pattern of DNA elimination in rice long terminal repeat retrotransposons? , 2009, Genome research.

[39]  T. Michael,et al.  Extreme haplotype variation in the desiccation-tolerant clubmoss Selaginella lepidophylla , 2018, Nature Communications.

[40]  Richard M. Clark,et al.  The Arabidopsis lyrata genome sequence and the basis of rapid genome size change , 2011, Nature Genetics.

[41]  Evgeny M. Zdobnov,et al.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs , 2015, Bioinform..