Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the mega-reads algorithm

Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii, a large and highly repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.

[1]  Thomas Hackl,et al.  proovread: large-scale high-accuracy PacBio correction through iterative short read consensus , 2014, Bioinform..

[2]  Pui-Yan Kwok,et al.  Rapid Genome Mapping in Nanochannel Arrays for Highly Complete and Accurate De Novo Sequence Assembly of the Complex Aegilops tauschii Genome , 2013, PloS one.

[3]  J. Messing,et al.  Analysis of tandem gene copies in maize chromosomal regions reconstructed from long sequence reads , 2016, Proceedings of the National Academy of Sciences.

[4]  J. Batley,et al.  A chromosome-based draft sequence of the hexaploid bread wheat (Triticum aestivum) genome , 2014, Science.

[5]  Damon Lisch,et al.  How important are transposons for plant evolution? , 2012, Nature Reviews Genetics.

[6]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[7]  Yadan Luo,et al.  Aegilops tauschii draft genome sequence reveals a gene repertoire for wheat adaptation , 2013, Nature.

[8]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[9]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..

[10]  Jan Vrána,et al.  BioNano genome mapping of individual chromosomes supports physical mapping and sequence assembly in complex plant genomes , 2016, Plant biotechnology journal.

[11]  P. Kwok,et al.  Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly , 2012, Nature Biotechnology.

[12]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[13]  Leena Salmela,et al.  LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[14]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[15]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[16]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[17]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[18]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[19]  Michael Roberts,et al.  The MaSuRCA genome assembler , 2013, Bioinform..