Streaming histogram sketching for rapid microbiome analytics

Motivation The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research; allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time. To address this need, we propose a new method for the compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching, and classification of microbiome samples in near real-time. Results We apply streaming histogram sketching to microbiome samples as a form of dimensionality reduction, creating a compressed ‘histosketch’ that can be used to efficiently represent microbiome k-mer spectra. Using public microbiome datasets, we show that histosketches can be clustered by sample type using pairwise Jaccard similarity estimation, consequently allowing for rapid microbiome similarity searches via a locality sensitive hashing indexing scheme. Furthermore, we show that histosketches can be used to train machine learning classifiers to accurately label microbiome samples. Specifically, using a collection of 108 novel microbiome samples from a cohort of premature neonates, we trained and tested a Random Forest Classifier that could accurately predict whether the neonate had received antibiotic treatment (95% accuracy, precision 97%) and could subsequently be used to classify microbiome data streams in less than 12 seconds. We provide our implementation, Histosketching Using Little K-mers (HULK), which can histosketch a typical 2GB microbiome in 50 seconds on a standard laptop using 4 cores, with the sketch occupying 3000 bytes of disk space. Availability Our implementation (HULK) is written in Go and is available at: https://github.com/will-rowe/hulk (MIT License)

[1]  Bonnie Berger,et al.  Metagenomic binning through low-density hashing , 2017, bioRxiv.

[2]  Antti Honkela,et al.  Exploration and retrieval of whole-metagenome sequencing samples , 2013, Bioinform..

[3]  J. Wain,et al.  Optimisation of 16S rRNA gut microbiota profiling of extremely low birth weight infants , 2017, BMC Genomics.

[4]  Olivier Sallou,et al.  Recommendations for the packaging and containerizing of bioinformatics software , 2018, F1000Research.

[5]  Dominique Lavenier,et al.  Multiple comparative metagenomics using multiset k-mer counting , 2016, PeerJ Comput. Sci..

[6]  Doug Stryke,et al.  Rapid metagenomic identification of viral pathogens in clinical samples by real-time nanopore sequencing analysis , 2015, Genome Medicine.

[7]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[8]  Daniel J. Blankenberg,et al.  Recommendations for the packaging and containerizing of bioinformatics software , 2022 .

[9]  Luiz Irber,et al.  sourmash: a library for MinHash sketching of DNA , 2016, J. Open Source Softw..

[10]  Curtis Huttenhower,et al.  Chapter 12: Human Microbiome Analysis , 2012, PLoS Comput. Biol..

[11]  Rick L. Stevens,et al.  A communal catalogue reveals Earth’s multiscale microbial diversity , 2017, Nature.

[12]  Peer Bork,et al.  Similarity of the dog and human gut microbiomes in gene content and response to diet , 2018, Microbiome.

[13]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[14]  M. Moffatt,et al.  Dysbiosis Anticipating Necrotizing Enterocolitis in Very Premature Infants , 2014, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[15]  M. Workentine,et al.  The Challenge and Potential of Metagenomics in the Clinic , 2016, Front. Immunol..

[16]  Dmitry S. Ischenko,et al.  Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis , 2016, BMC Bioinformatics.

[17]  Qingpeng Zhang,et al.  These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure , 2013, PloS one.

[18]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[19]  Will P. M. Rowe,et al.  Indexed variation graphs for efficient and accurate resistome profiling , 2018 .

[20]  Kenney Ng,et al.  Curating and integrating user-generated health data from multiple sources to support healthcare analytics , 2018, IBM J. Res. Dev..

[21]  Natalie C. Knox,et al.  Highlighting Clinical Metagenomics for Enhanced Diagnostic Decision-making: A Step Towards Wider Implementation , 2018, Computational and Structural Biotechnology Journal.

[22]  Ivan Koychev,et al.  Gradual Forgetting for Adaptation to Concept Drift , 2000 .

[23]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[24]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[25]  Renan Valieris,et al.  Bioconda: sustainable and comprehensive software distribution for the life sciences , 2018, Nature Methods.

[26]  Jeroen F. J. Laros,et al.  Determining the quality and complexity of next-generation sequencing data without a reference genome , 2014, Genome Biology.

[27]  Will P. M. Rowe,et al.  Indexed variation graphs for efficient and accurate resistome profiling , 2018, bioRxiv.

[28]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[29]  M. Moffatt,et al.  Late-Onset Bloodstream Infection and Perturbed Maturation of the Gastrointestinal Microbiota in Premature Infants , 2015, PloS one.

[30]  Philip D. Blood,et al.  Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software , 2017, Nature Methods.

[31]  Wes McKinney,et al.  pandas: a Foundational Python Library for Data Analysis and Statistics , 2011 .

[32]  A. Halpern,et al.  The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific , 2007, PLoS biology.

[33]  Hooman Zabeti,et al.  IMPROVING MIN HASH VIA THE CONTAINMENT INDEX WITH APPLICATIONS TO METAGENOMIC ANALYSIS , 2017 .

[34]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[35]  Bin Li,et al.  HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms with Concept Drift , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[36]  Sergey Ioffe,et al.  Improved Consistent Sampling, Weighted Minhash and L1 Sketching , 2010, 2010 IEEE International Conference on Data Mining.

[37]  Piotr Indyk,et al.  Scalable Techniques for Clustering the Web (Extended Abstract) , 2000 .

[38]  Chengqi Zhang,et al.  Consistent Weighted Sampling Made More Practical , 2017, WWW.