论文信息 - When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data

When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data

Considerable advances in genomics over the past decade have resulted in vast amounts of data being generated and deposited in global archives. The growth of these archives exceeds our ability to process their content, leading to significant analysis bottlenecks. Sketching algorithms produce small, approximate summaries of data and have shown great utility in tackling this flood of genomic data, while using minimal compute resources. This article reviews the current state of the field, focusing on how the algorithms work and how genomicists can utilize them effectively. References to interactive workbooks for explaining concepts and demonstrating workflows are included at https://github.com/will-rowe/genome-sketching.

Will P. M. Rowe | W. Rowe

[1] Daniel N. Baker,et al. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts , 2018, Genome Biology.

[2] Hooman Zabeti,et al. Improving MinHash via the containment index with applications to metagenomic analysis , 2019, Appl. Math. Comput..

[3] Xiaoyong Du,et al. Persistent Data Sketching , 2015, SIGMOD Conference.

[4] Justin Chu,et al. ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter , 2016, bioRxiv.

[5] Robert Nowak,et al. De novo assembly of bacterial genomes with repetitive DNA regions by dnaasm application , 2018, BMC Bioinformatics.

[6] Michael Mitzenmacher,et al. Less Hashing, Same Performance: Building a Better Bloom Filter , 2006, ESA.

[7] Carl Kingsford,et al. Sketching and Sublinear Data Structures in Genomics , 2019, Annual Review of Biomedical Data Science.

[8] Brian D. Ondov,et al. Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[9] Paul Medvedev,et al. Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[10] XiaoFei Zhao,et al. BinDash, software for fast genome distance estimation on a typical personal laptop , 2018, Bioinform..

[11] Walter L. Ruzzo,et al. Compression of next-generation sequencing reads aided by highly efficient de novo assembly , 2012, Nucleic acids research.

[12] Carl Kingsford,et al. Fast Search of Thousands of Short-Read Sequencing Experiments , 2015, Nature Biotechnology.

[13] Andrei Z. Broder,et al. Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[14] Luca Trevisan,et al. Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[15] Philippe Flajolet,et al. Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[16] B. Langmead,et al. Lighter: fast and memory-efficient sequencing error correction without counting , 2014, Genome Biology.

[17] Ananth Kalyanaraman,et al. FastEtch: A Fast Sketch-Based Assembler for Genomes , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18] Bin Li,et al. HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms with Concept Drift , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[19] Michael Roberts,et al. Reducing storage requirements for biological sequence comparison , 2004, Bioinform..

[20] Anna Paola Carrieri,et al. Streaming histogram sketching for rapid microbiome analytics , 2018, bioRxiv.

[21] Serafim Batzoglou,et al. A hybrid cloud read aligner based on MinHash and kmer voting that preserves privacy , 2017, Nature Communications.

[22] Prashant Pandey,et al. Locality-sensitive hashing for the edit distance , 2019, bioRxiv.

[23] Alexander Hall,et al. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm , 2013, EDBT '13.

[24] Li Fan,et al. Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[25] Srinivas Aluru,et al. A Fast Adaptive Algorithm for Computing Whole-Genome Homology Maps , 2018 .

[26] Michael A. Bender,et al. deBGR: an efficient and near-exact representation of the weighted de Bruijn graph , 2017, Bioinform..

[27] Bin Fan,et al. Cuckoo Filter: Practically Better Than Bloom , 2014, CoNEXT.

[28] Michael A. Bender,et al. Squeakr: An Exact and Approximate k-mer Counting System , 2017, bioRxiv.

[29] Hamid Mohamadi,et al. ntCard: a streaming algorithm for cardinality estimation in genomics data , 2017, Bioinform..

[30] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[31] Sergey Koren,et al. Mash Screen: high-throughput sequence containment estimation for genome discovery , 2019, Genome Biology.

[32] Yongge Wang,et al. Randomization and Approximation Techniques in Computer Science , 1997, Lecture Notes in Computer Science.

[33] Daniel N. Baker,et al. Dashing: fast and accurate genomic distances with HyperLogLog , 2018, Genome Biology.

[34] Chirag Jain,et al. A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases , 2017, RECOMB.

[35] Michael A. Bender,et al. A General-Purpose Counting Filter: Making Every Bit Count , 2017, SIGMOD Conference.

[36] Yongchao Liu,et al. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data , 2013, Bioinform..

[37] Xiaolong Wu,et al. BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..

[38] Anna Paola Carrieri,et al. A Fast Machine Learning Workflow for Rapid Phenotype Prediction from Whole Shotgun Metagenomes , 2019, AAAI.

[39] Huzefa Rangwala,et al. MC-MinH: Metagenome Clustering using Minwise based Hashing , 2013, SDM.

[40] G. Smith,et al. Rapid bacterial whole-genome sequencing to enhance diagnostic and public health microbiology. , 2013, JAMA internal medicine.

[41] Páll Melsted,et al. Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[42] Graham Cormode,et al. Data Sketching , 2017, ACM Queue.