Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing

We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.

[1]  P. Jaccard Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines , 1901 .

[2]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[3]  F. Sanger,et al.  A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. , 1975, Journal of molecular biology.

[4]  R. Staden A new computer method for the storage and manipulation of DNA gel reading data. , 1980, Nucleic acids research.

[5]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[6]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[7]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[8]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[9]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[10]  H. Biessmann,et al.  Frequent transpositions of Drosophila melanogaster HeT‐A transposable elements to receding chromosome ends. , 1992, The EMBO journal.

[11]  R. Levis,et al.  Transposons in place of telomeric repeats at a Drosophila telomere , 1993, Cell.

[12]  D. Schwartz,et al.  Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. , 1993, Science.

[13]  A. W. Kemp,et al.  Univariate Discrete Distributions , 1993 .

[14]  V A Zakian,et al.  Structure, function, and replication of Saccharomyces cerevisiae telomeres. , 1996, Annual review of genetics.

[15]  H. Mewes,et al.  Overview of the yeast genome. , 1997, Nature.

[16]  J. Weber,et al.  Human whole-genome shotgun sequencing. , 1997, Genome research.

[17]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[18]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[19]  Eugene W. Myers,et al.  ReAligner: a program for refining DNA sequence multi-alignments , 1997, RECOMB '97.

[20]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[21]  Elena S. Babaylova,et al.  Complete sequence and gene map of a human major histocompatibility complex , 1999, Nature.

[22]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[23]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[24]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[25]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[26]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[27]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[28]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[29]  M. Ashburner,et al.  The transposable elements of the Drosophila melanogaster euchromatin: a genomics perspective , 2002, Genome Biology.

[30]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[31]  Michael Roberts,et al.  Reducing storage requirements for biological sequence comparison , 2004, Bioinform..

[32]  Adam M. Phillippy,et al.  Comparative genome assembly , 2004, Briefings Bioinform..

[33]  Richard M. Karp,et al.  Gapped Local Similarity Search with Provable Guarantees , 2004, WABI.

[34]  Eugene W. Myers,et al.  AnO(ND) difference algorithm and its variations , 1986, Algorithmica.

[35]  J. Jurka,et al.  Repbase Update, a database of eukaryotic repetitive elements , 2005, Cytogenetic and Genome Research.

[36]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2005, RECOMB.

[37]  M. Blasco Telomeres and human disease: ageing, cancer and beyond , 2005, Nature Reviews Genetics.

[38]  J. Davis Univariate Discrete Distributions , 2006 .

[39]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2006, J. Comput. Biol..

[40]  S. Celniker,et al.  Genomic organization of the Drosophila telomere retrotransposable elements. , 2006, Genome research.

[41]  M. Schatz,et al.  Genome assembly forensics: finding the elusive mis-assembly , 2008, Genome Biology.

[42]  E. Frise,et al.  Sequence Finishing and Mapping of Drosophila melanogaster Heterochromatin , 2007, Science.

[43]  Andrew Zisserman,et al.  Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[44]  D. Branton,et al.  The potential and challenges of nanopore sequencing , 2008, Nature Biotechnology.

[45]  Mihai Pop,et al.  Genome assembly reborn: recent computational challenges , 2009, Briefings Bioinform..

[46]  H. Bayley,et al.  Continuous base identification for single-molecule nanopore DNA sequencing. , 2009, Nature nanotechnology.

[47]  S. Turner,et al.  Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[48]  David C. Schwartz,et al.  High-resolution human genome structure by single-molecule analysis , 2010, Proceedings of the National Academy of Sciences.

[49]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[50]  M. Schatz,et al.  Assembly of large genomes using second-generation sequencing. , 2010, Genome research.

[51]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[52]  G. Timp,et al.  Nanopore Sequencing: Electrical Measurements of the Code of Life , 2010, IEEE Transactions on Nanotechnology.

[53]  Richard M. Karp,et al.  Faster and More Accurate Sequence Alignment with SNAP , 2011, ArXiv.

[54]  Elizabeth M. Ryan,et al.  De novo assembly of highly diverse viral populations , 2012, BMC Genomics.

[55]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[56]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[57]  Hans Lehrach,et al.  The Saccharomyces cerevisiae W303-K6001 cross-platform genome sequence: insights into ancestry and physiology of a laboratory mutt , 2012, Open Biology.

[58]  C. Nusbaum,et al.  Finished bacterial genomes from shotgun sequence data , 2012, Genome research.

[59]  Knut Reinert,et al.  RazerS 3: Faster, fully sensitive read mapping , 2012, Bioinform..

[60]  R. Gibbs,et al.  Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology , 2012, PloS one.

[61]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[62]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[63]  Noam Kaplan,et al.  High-throughput genome scaffolding from in-vivo DNA interaction frequency , 2013, Nature Biotechnology.

[64]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[65]  Andrew C. Adey,et al.  Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions , 2013, Nature Biotechnology.

[66]  Bing Ren,et al.  Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing , 2013, Nature Biotechnology.

[67]  Huzefa Rangwala,et al.  MC-MinH: Metagenome Clustering using Minwise based Hashing , 2013, SDM.

[68]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[69]  Timothy P. L. Smith,et al.  Reducing assembly complexity of microbial genomes with single-molecule sequencing , 2013, Genome Biology.

[70]  M. Pop,et al.  Sequence assembly demystified , 2013, Nature Reviews Genetics.

[71]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[72]  David Tse,et al.  Optimal assembly for high throughput shotgun sequencing , 2013, BMC Bioinformatics.

[73]  Steven Salzberg,et al.  GAGE-B: an evaluation of genome assemblers for bacterial organisms , 2013, Bioinform..

[74]  Laura Ponting,et al.  FlyBase 102—advanced approaches to interrogating FlyBase , 2013, Nucleic Acids Res..

[75]  Adam M Phillippy,et al.  Long-read, whole-genome shotgun sequence data for five model organisms , 2014, Scientific Data.

[76]  Loman Nicholas,et al.  A P. aeruginosa serotype-defining single read from our first Oxford Nanopore run , 2014 .

[77]  David Haussler,et al.  The UCSC Genome Browser database: 2014 update , 2013, Nucleic Acids Res..

[78]  Long-read, whole-genome shotgun sequence data for five model organisms , 2014 .

[79]  Sergey A. Shiryev,et al.  Single haplotype assembly of the human genome from a hydatidiform mole , 2014, bioRxiv.

[80]  Matthias Platzer,et al.  RepARK—de novo creation of repeat libraries from whole-genome NGS reads , 2014, Nucleic acids research.

[81]  Shoshana Marcus,et al.  Error correction and assembly complexity of single molecule sequencing reads , 2014, bioRxiv.

[82]  David Tse,et al.  Near-optimal assembly for shotgun sequencing with noisy reads , 2014, BMC Bioinformatics.

[83]  Rajiv C. McCoy,et al.  Illumina TruSeq Synthetic Long-Reads Empower De Novo Assembly and Resolve Complex, Highly-Repetitive Transposable Elements , 2014, bioRxiv.

[84]  Rajiv C. McCoy,et al.  Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly repetitive transposable elements , 2014 .

[85]  Paul O'Neill,et al.  The second Oxford Nanopore read ever published , 2014 .