Spaced Seed Data Structures for De Novo Assembly

De novo assembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.

[1]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[2]  Heng Li,et al.  Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly , 2012, Bioinform..

[3]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[4]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[5]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[6]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[7]  H BloomBurton Space/time trade-offs in hash coding with allowable errors , 1970 .

[8]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[9]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[10]  E. Lyon,et al.  Digital fetal aneuploidy diagnosis by next-generation sequencing. , 2010, Clinical chemistry.

[11]  Irmtraud M. Meyer,et al.  The clonal and mutational evolution spectrum of primary triple-negative breast cancers , 2012, Nature.

[12]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[13]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[14]  Jessica C. Ebert,et al.  Computational Techniques for Human Genome Resequencing Using Mated Gapped Reads , 2012, J. Comput. Biol..

[15]  Steven J. M. Jones,et al.  De novo assembly and analysis of RNA-seq data , 2010, Nature Methods.

[16]  Wei Wu,et al.  Concurrent CIC mutations, IDH mutations, and 1p/19q loss distinguish oligodendrogliomas from other cancers , 2012, The Journal of pathology.

[17]  Inanç Birol,et al.  De novo transcriptome assembly with ABySS , 2009, Bioinform..

[18]  Martin Vingron,et al.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels , 2012, Bioinform..

[19]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[20]  R. Durbin,et al.  Dindel: accurate indel calls from short-read data. , 2011, Genome research.

[21]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[22]  René L. Warren,et al.  LINKS: Scaffolding genome assemblies with kilobase-long nanopore reads , 2015, bioRxiv.

[23]  Lucian Ilie,et al.  SpEED: fast computation of sensitive spaced seeds , 2011, Bioinform..

[24]  Steven J. M. Jones,et al.  BreakFusion: targeted assembly-based identification of gene fusions in whole transcriptome paired-end sequencing data , 2012, Bioinform..

[25]  Ryan D. Morin,et al.  Genetic alterations activating kinase and cytokine receptor signaling in high-risk acute lymphoblastic leukemia. , 2012, Cancer cell.

[26]  Steven J. M. Jones,et al.  The genetic landscape of high-risk neuroblastoma , 2013, Nature Genetics.

[27]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[28]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[29]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[30]  Steven J. M. Jones,et al.  Frequent mutation of histone modifying genes in non-Hodgkin lymphoma , 2011, Nature.

[31]  Thomas M. Keane,et al.  Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly , 2010, Genome Biology.

[32]  L DECOURT,et al.  [Turner syndrome: presentation of four cases]. , 1954, Revista paulista de medicina.

[33]  Dmitry Antipov,et al.  Pathset Graphs: A Novel Approach for Comprehensive Utilization of Paired Reads in Genome Assembly , 2012, RECOMB.

[34]  S LEVI [Nature and origin of mongolism; critical review of etiopathogenetic problem]. , 1951, Rivista di clinica pediatrica.

[35]  Haixu Tang,et al.  Fragment assembly with double-barreled data , 2001, ISMB.

[36]  Faraz Hach,et al.  mrsFAST: a cache-oblivious algorithm for short-read mapping , 2010, Nature Methods.

[37]  Süleyman Cenk Sahinalp,et al.  deFuse: An Algorithm for Gene Fusion Discovery in Tumor RNA-Seq Data , 2011, PLoS Comput. Biol..

[38]  Alexander F. Wilson,et al.  Research in Genomic Medicine the Clinseq Project: Piloting Large-scale Genome Sequencing for Material Supplemental , 2009 .

[39]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[40]  Gabor T. Marth,et al.  Whole-genome sequencing and variant discovery in C. elegans , 2008, Nature Methods.