Libra: scalable k-mer–based tool for massive all-vs-all metagenome comparisons

Abstract Background Shotgun metagenomics provides powerful insights into microbial community biodiversity and function. Yet, inferences from metagenomic studies are often limited by dataset size and complexity and are restricted by the availability and completeness of existing databases. De novo comparative metagenomics enables the comparison of metagenomes based on their total genetic content. Results We developed a tool called Libra that performs an all-vs-all comparison of metagenomes for precise clustering based on their k-mer content. Libra uses a scalable Hadoop framework for massive metagenome comparisons, Cosine Similarity for calculating the distance using sequence composition and abundance while normalizing for sequencing depth, and a web-based implementation in iMicrobe (http://imicrobe.us) that uses the CyVerse advanced cyberinfrastructure to promote broad use of the tool by the scientific community. Conclusions A comparison of Libra to equivalent tools using both simulated and real metagenomic datasets, ranging from 80 million to 4.2 billion reads, reveals that methods commonly implemented to reduce compute time for large datasets, such as data reduction, read count normalization, and presence/absence distance metrics, greatly diminish the resolution of large-scale comparative analyses. In contrast, Libra uses all of the reads to calculate k-mer abundance in a Hadoop architecture that can scale to any size dataset to enable global-scale analyses and link microbial signatures to biological processes.

[1]  Matthew B. Sullivan,et al.  The Pacific Ocean Virome (POV): A Marine Viral Metagenomic Dataset and Associated Protein Clusters for Quantitative Viral Ecology , 2013, PloS one.

[2]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[3]  Pavan Balaji,et al.  Bloomfish: A Highly Scalable Distributed K-mer Counting Framework , 2017, 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS).

[4]  Steven B Cannon,et al.  Bringing your tools to CyVerse Discovery Environment using Docker , 2016, F1000Research.

[5]  Andrew Zisserman,et al.  Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[6]  S. Quake,et al.  Dissecting biological “dark matter” with single-cell genetic analysis of rare and uncultivated TM7 microbes from the human mouth , 2007, Proceedings of the National Academy of Sciences.

[7]  S. Kurtz,et al.  A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes , 2008, BMC Genomics.

[8]  Natalia N. Ivanova,et al.  Insights into the phylogeny and coding potential of microbial dark matter , 2013, Nature.

[9]  Antti Honkela,et al.  Exploration and retrieval of whole-metagenome sequencing samples , 2013, Bioinform..

[10]  M. Michie Use of the Bray-Curtis similarity measure in cluster analysis of foraminiferal data , 1982 .

[11]  Dmitry G. Alexeev,et al.  MetaFast: fast reference-free graph-based comparison of shotgun metagenomic data , 2016, Bioinform..

[12]  Bonnie L Hurwitz,et al.  Modeling ecological drivers in marine viral communities using comparative metagenomics and network analyses , 2014, Proceedings of the National Academy of Sciences.

[13]  Bonnie L Hurwitz,et al.  Depth-stratified functional and taxonomic niche specialization in the ‘core’ and ‘flexible’ Pacific Ocean Virome , 2014, The ISME Journal.

[14]  Brian C. Thomas,et al.  A new view of the tree of life , 2016, Nature Microbiology.

[15]  M. Diepenbroek,et al.  PANGAEA: an information system for environmental sciences , 2002 .

[16]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[17]  Luiz Irber,et al.  sourmash: a library for MinHash sketching of DNA , 2016, J. Open Source Softw..

[18]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[19]  B. Hurwitz,et al.  16S rRNA gene sequencing on a benchtop sequencer: accuracy for identification of clinically important bacteria , 2017, Journal of applied microbiology.

[20]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[21]  P. Bork,et al.  Patterns and ecological drivers of ocean viral communities , 2015, Science.

[22]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[23]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[24]  Weisong Shi,et al.  CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping , 2011, BMC Research Notes.

[25]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[26]  Dmitry S. Ischenko,et al.  Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis , 2016, BMC Bioinformatics.

[27]  Hooman Zabeti,et al.  IMPROVING MIN HASH VIA THE CONTAINMENT INDEX WITH APPLICATIONS TO METAGENOMIC ANALYSIS , 2017 .

[28]  T. Thomas,et al.  GemSIM: general, error-model based simulator of next-generation sequencing data , 2012, BMC Genomics.

[29]  Michael C. Schatz,et al.  Rapid parallel genome indexing with MapReduce , 2011, MapReduce '11.

[30]  Benjamin J. Raphael,et al.  The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families , 2007, PLoS biology.

[31]  G. Bratbak,et al.  High abundance of viruses found in aquatic environments , 1989, Nature.

[32]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[33]  Stéphane Le Crom,et al.  Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses , 2012, Bioinform..

[34]  Kai Wang,et al.  BioPig: a Hadoop-based analytic toolkit for large-scale sequence data , 2013, Bioinform..

[35]  Frank Oliver Glöckner,et al.  TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences , 2004, BMC Bioinformatics.

[36]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[37]  Luis Pedro Coelho,et al.  Structure and function of the global ocean microbiome , 2015, Science.

[38]  B. Langmead,et al.  Cloud-scale RNA-sequencing differential expression analysis with Myrna , 2010, Genome Biology.

[39]  Christian Schlötterer,et al.  DistMap: A Toolkit for Distributed Short Read Mapping on a Hadoop Cluster , 2013, PloS one.

[40]  Frederic D. Bushman,et al.  Conservation of Gene Cassettes among Diverse Viruses of the Human Gut , 2012, PloS one.

[41]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[42]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[43]  Xiaoyu Wang,et al.  A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis , 2012, Briefings Bioinform..

[44]  Limin Fu,et al.  Artificial and natural duplicates in pyrosequencing reads of metagenomic data , 2010, BMC Bioinformatics.

[45]  Winston Haynes,et al.  Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins. , 2011, Omics : a journal of integrative biology.

[46]  Peter J. Tonellato,et al.  Cloud computing for comparative genomics , 2010, BMC Bioinformatics.

[47]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[48]  Dominique Lavenier,et al.  Compareads: comparing huge metagenomic experiments , 2012, BMC Bioinformatics.

[49]  Dominique Lavenier,et al.  Commet: Comparing and combining multiple metagenomic datasets , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[50]  Yunpeng Cai,et al.  ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time , 2011, Nucleic acids research.

[51]  Dominique Lavenier,et al.  Multiple comparative metagenomics using multiset k-mer counting , 2016, PeerJ Comput. Sci..

[52]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[53]  Shujiro Okuda,et al.  Virtual metagenome reconstruction from 16S rRNA gene sequences , 2012, Nature Communications.

[54]  Yi Luo,et al.  How independent are the appearances of n-mers in different genomes? , 2004, Bioinform..

[55]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[56]  Shaoliang Peng,et al.  Bioinformatics applications on Apache Spark , 2018, GigaScience.

[57]  Sven Rahmann,et al.  SimLoRD: Simulation of Long Read Data , 2016, Bioinform..

[58]  B. S. Manjunath,et al.  The iPlant Collaborative: Cyberinfrastructure for Plant Biology , 2011, Front. Plant Sci..