论文信息 - Libra: scalable k-mer–based tool for massive all-vs-all metagenome comparisons

Libra: scalable k-mer–based tool for massive all-vs-all metagenome comparisons

Abstract Background Shotgun metagenomics provides powerful insights into microbial community biodiversity and function. Yet, inferences from metagenomic studies are often limited by dataset size and complexity and are restricted by the availability and completeness of existing databases. De novo comparative metagenomics enables the comparison of metagenomes based on their total genetic content. Results We developed a tool called Libra that performs an all-vs-all comparison of metagenomes for precise clustering based on their k-mer content. Libra uses a scalable Hadoop framework for massive metagenome comparisons, Cosine Similarity for calculating the distance using sequence composition and abundance while normalizing for sequencing depth, and a web-based implementation in iMicrobe (http://imicrobe.us) that uses the CyVerse advanced cyberinfrastructure to promote broad use of the tool by the scientific community. Conclusions A comparison of Libra to equivalent tools using both simulated and real metagenomic datasets, ranging from 80 million to 4.2 billion reads, reveals that methods commonly implemented to reduce compute time for large datasets, such as data reduction, read count normalization, and presence/absence distance metrics, greatly diminish the resolution of large-scale comparative analyses. In contrast, Libra uses all of the reads to calculate k-mer abundance in a Hadoop architecture that can scale to any size dataset to enable global-scale analyses and link microbial signatures to biological processes.

[1] Matthew B. Sullivan,et al. The Pacific Ocean Virome (POV): A Marine Viral Metagenomic Dataset and Associated Protein Clusters for Quantitative Viral Ecology , 2013, PloS one.

[2] Anna-Lan Huang,et al. Similarity Measures for Text Document Clustering , 2008 .

[3] Pavan Balaji,et al. Bloomfish: A Highly Scalable Distributed K-mer Counting Framework , 2017, 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS).

[4] Steven B Cannon,et al. Bringing your tools to CyVerse Discovery Environment using Docker , 2016, F1000Research.

[5] Andrew Zisserman,et al. Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[6] S. Quake,et al. Dissecting biological “dark matter” with single-cell genetic analysis of rare and uncultivated TM7 microbes from the human mouth , 2007, Proceedings of the National Academy of Sciences.

[7] S. Kurtz,et al. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes , 2008, BMC Genomics.

[8] Natalia N. Ivanova,et al. Insights into the phylogeny and coding potential of microbial dark matter , 2013, Nature.

[9] Antti Honkela,et al. Exploration and retrieval of whole-metagenome sequencing samples , 2013, Bioinform..

[10] M. Michie. Use of the Bray-Curtis similarity measure in cluster analysis of foraminiferal data , 1982 .

[11] Dmitry G. Alexeev,et al. MetaFast: fast reference-free graph-based comparison of shotgun metagenomic data , 2016, Bioinform..

[12] Bonnie L Hurwitz,et al. Modeling ecological drivers in marine viral communities using comparative metagenomics and network analyses , 2014, Proceedings of the National Academy of Sciences.

[13] Bonnie L Hurwitz,et al. Depth-stratified functional and taxonomic niche specialization in the ‘core’ and ‘flexible’ Pacific Ocean Virome , 2014, The ISME Journal.

[14] Brian C. Thomas,et al. A new view of the tree of life , 2016, Nature Microbiology.

[15] M. Diepenbroek,et al. PANGAEA: an information system for environmental sciences , 2002 .

[16] Yu-Wei Wu,et al. A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[17] Luiz Irber,et al. sourmash: a library for MinHash sketching of DNA , 2016, J. Open Source Softw..

[18] Jianhua Lin,et al. Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[19] B. Hurwitz,et al. 16S rRNA gene sequencing on a benchtop sequencer: accuracy for identification of clinically important bacteria , 2017, Journal of applied microbiology.

[20] Robert C. Edgar,et al. BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[21] P. Bork,et al. Patterns and ecological drivers of ocean viral communities , 2015, Science.

[22] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[23] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[24] Weisong Shi,et al. CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping , 2011, BMC Research Notes.

[25] M. DePristo,et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[26] Dmitry S. Ischenko,et al. Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis , 2016, BMC Bioinformatics.

[27] Hooman Zabeti,et al. IMPROVING MIN HASH VIA THE CONTAINMENT INDEX WITH APPLICATIONS TO METAGENOMIC ANALYSIS , 2017 .

[28] T. Thomas,et al. GemSIM: general, error-model based simulator of next-generation sequencing data , 2012, BMC Genomics.

[29] Michael C. Schatz,et al. Rapid parallel genome indexing with MapReduce , 2011, MapReduce '11.

[30] Benjamin J. Raphael,et al. The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families , 2007, PLoS biology.

[31] G. Bratbak,et al. High abundance of viruses found in aquatic environments , 1989, Nature.

[32] Gerard Salton,et al. A vector space model for automatic indexing , 1975, CACM.

[33] Stéphane Le Crom,et al. Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses , 2012, Bioinform..

[34] Kai Wang,et al. BioPig: a Hadoop-based analytic toolkit for large-scale sequence data , 2013, Bioinform..

[35] Frank Oliver Glöckner,et al. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences , 2004, BMC Bioinformatics.