When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data

Considerable advances in genomics over the past decade have resulted in vast amounts of data being generated and deposited in global archives. The growth of these archives exceeds our ability to process their content, leading to significant analysis bottlenecks. Sketching algorithms produce small, approximate summaries of data and have shown great utility in tackling this flood of genomic data, while using minimal compute resources. This article reviews the current state of the field, focusing on how the algorithms work and how genomicists can utilize them effectively. References to interactive workbooks for explaining concepts and demonstrating workflows are included at https://github.com/will-rowe/genome-sketching.

[1]  Daniel N. Baker,et al.  KrakenUniq: confident and fast metagenomics classification using unique k-mer counts , 2018, Genome Biology.

[2]  Hooman Zabeti,et al.  Improving MinHash via the containment index with applications to metagenomic analysis , 2019, Appl. Math. Comput..

[3]  Xiaoyong Du,et al.  Persistent Data Sketching , 2015, SIGMOD Conference.

[4]  Justin Chu,et al.  ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter , 2016, bioRxiv.

[5]  Robert Nowak,et al.  De novo assembly of bacterial genomes with repetitive DNA regions by dnaasm application , 2018, BMC Bioinformatics.

[6]  Michael Mitzenmacher,et al.  Less Hashing, Same Performance: Building a Better Bloom Filter , 2006, ESA.

[7]  Carl Kingsford,et al.  Sketching and Sublinear Data Structures in Genomics , 2019, Annual Review of Biomedical Data Science.

[8]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[9]  Paul Medvedev,et al.  Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[10]  XiaoFei Zhao,et al.  BinDash, software for fast genome distance estimation on a typical personal laptop , 2018, Bioinform..

[11]  Walter L. Ruzzo,et al.  Compression of next-generation sequencing reads aided by highly efficient de novo assembly , 2012, Nucleic acids research.

[12]  Carl Kingsford,et al.  Fast Search of Thousands of Short-Read Sequencing Experiments , 2015, Nature Biotechnology.

[13]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[14]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[15]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[16]  B. Langmead,et al.  Lighter: fast and memory-efficient sequencing error correction without counting , 2014, Genome Biology.

[17]  Ananth Kalyanaraman,et al.  FastEtch: A Fast Sketch-Based Assembler for Genomes , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Bin Li,et al.  HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms with Concept Drift , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[19]  Michael Roberts,et al.  Reducing storage requirements for biological sequence comparison , 2004, Bioinform..

[20]  Anna Paola Carrieri,et al.  Streaming histogram sketching for rapid microbiome analytics , 2018, bioRxiv.

[21]  Serafim Batzoglou,et al.  A hybrid cloud read aligner based on MinHash and kmer voting that preserves privacy , 2017, Nature Communications.

[22]  Prashant Pandey,et al.  Locality-sensitive hashing for the edit distance , 2019, bioRxiv.

[23]  Alexander Hall,et al.  HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm , 2013, EDBT '13.

[24]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[25]  Srinivas Aluru,et al.  A Fast Adaptive Algorithm for Computing Whole-Genome Homology Maps , 2018 .

[26]  Michael A. Bender,et al.  deBGR: an efficient and near-exact representation of the weighted de Bruijn graph , 2017, Bioinform..

[27]  Bin Fan,et al.  Cuckoo Filter: Practically Better Than Bloom , 2014, CoNEXT.

[28]  Michael A. Bender,et al.  Squeakr: An Exact and Approximate k-mer Counting System , 2017, bioRxiv.

[29]  Hamid Mohamadi,et al.  ntCard: a streaming algorithm for cardinality estimation in genomics data , 2017, Bioinform..

[30]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[31]  Sergey Koren,et al.  Mash Screen: high-throughput sequence containment estimation for genome discovery , 2019, Genome Biology.

[32]  Yongge Wang,et al.  Randomization and Approximation Techniques in Computer Science , 1997, Lecture Notes in Computer Science.

[33]  Daniel N. Baker,et al.  Dashing: fast and accurate genomic distances with HyperLogLog , 2018, Genome Biology.

[34]  Chirag Jain,et al.  A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases , 2017, RECOMB.

[35]  Michael A. Bender,et al.  A General-Purpose Counting Filter: Making Every Bit Count , 2017, SIGMOD Conference.

[36]  Yongchao Liu,et al.  Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data , 2013, Bioinform..

[37]  Xiaolong Wu,et al.  BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..

[38]  Anna Paola Carrieri,et al.  A Fast Machine Learning Workflow for Rapid Phenotype Prediction from Whole Shotgun Metagenomes , 2019, AAAI.

[39]  Huzefa Rangwala,et al.  MC-MinH: Metagenome Clustering using Minwise based Hashing , 2013, SDM.

[40]  G. Smith,et al.  Rapid bacterial whole-genome sequencing to enhance diagnostic and public health microbiology. , 2013, JAMA internal medicine.

[41]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[42]  Graham Cormode,et al.  Data Sketching , 2017, ACM Queue.

[43]  Wael Hassan Gomaa,et al.  A Survey of Text Similarity Approaches , 2013 .

[44]  Daniel Standage,et al.  The khmer software package: enabling efficient nucleotide sequence analysis , 2015, F1000Research.

[45]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.

[46]  Edith Cohen,et al.  Summarizing data using bottom-k sketches , 2007, PODC '07.

[47]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[48]  Ehsan Eydi,et al.  Buffered Count-Min Sketch , 2017 .

[49]  Edith Cohen,et al.  Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..

[50]  Ping Li,et al.  b-Bit minwise hashing , 2009, WWW '10.

[51]  Chirag Jain,et al.  A fast adaptive algorithm for computing whole-genome homology maps , 2018, bioRxiv.

[52]  Georges Hébrail,et al.  Sliding HyperLogLog: Estimating Cardinality in a Data Stream over a Sliding Window , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[53]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[54]  Michael A. Bender,et al.  Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index. , 2018, Cell systems.

[55]  Will P. M. Rowe,et al.  Indexed variation graphs for efficient and accurate resistome profiling , 2018, bioRxiv.

[56]  Phelim Bradley,et al.  Ultra-fast search of all deposited bacterial and viral genomic data , 2019, Nature Biotechnology.

[57]  Yanling Lin,et al.  Sequences Dimensionality-Reduction by K-mer Substring Space Sampling Enables Effective Resemblance- and Containment-Analysis for Large-Scale omics-data , 2019, bioRxiv.

[58]  Qingpeng Zhang,et al.  These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure , 2013, PloS one.

[59]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[60]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[61]  David A. Patterson,et al.  Attack of the killer microseconds , 2017, Commun. ACM.

[62]  Jordan A. Fish,et al.  Xander: employing a novel method for efficient gene-targeted metagenomic assembly , 2015, Microbiome.

[63]  Graham Cormode,et al.  Data sketching , 2017, Commun. ACM.

[64]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[65]  Tim Head,et al.  Binder 2.0 - Reproducible, interactive, sharable environments for science at scale , 2018, SciPy.

[66]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[67]  Ryan P. Adams,et al.  A Bayesian Nonparametric View on Count-Min Sketch , 2018, NeurIPS.

[68]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[69]  Roderick Bovee,et al.  Finch: a tool adding dynamic abundance filtering to genomic MinHashing , 2018, J. Open Source Softw..

[70]  Brian Bushnell,et al.  BBMap: A Fast, Accurate, Splice-Aware Aligner , 2014 .

[71]  Ping Li,et al.  One Permutation Hashing , 2012, NIPS.

[72]  Prashant Pandey,et al.  Locality-sensitive hashing for the edit distance , 2019, Bioinform..

[73]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[74]  Luiz Irber,et al.  sourmash: a library for MinHash sketching of DNA , 2016, J. Open Source Softw..

[75]  Leonid Oliker,et al.  Parallel De Bruijn Graph Construction and Traversal for De Novo Genome Assembly , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[76]  Srinivas Aluru,et al.  A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases , 2017, bioRxiv.