Deconvolute individual genomes from metagenome sequences through read clustering

Motivation Metagenome assembly from short next-generation sequencing data is a challenging process due to its large scale and computational complexity. Clustering short reads before assembly offers a unique opportunity for parallel downstream assembly of genomes with individualized optimization. However, current read clustering methods suffer either false negative (under-clustering) or false positive (over-clustering) problems. Results Based on a previously developed scalable read clustering method on Apache Spark, SpaRC, that has very low false positives, here we extended its capability by adding a new method to further cluster small clusters. This method exploits statistics derived from multiple samples in a dataset to reduce the under-clustering problem. Using a synthetic dataset from mouse gut microbiomes we show that this method has the potential to cluster almost all of the reads from genomes with sufficient sequencing coverage. We also explored several clustering parameters that deferentially affect genomes with various sequencing coverage. Availability https://bitbucket.org/berkeleylab/jgi-sparc/. Contact zhongwang@lbl.gov

[1]  Xiandong Meng,et al.  SpaRC: Scalable Sequence Clustering using Apache Spark , 2018, bioRxiv.

[2]  WangJianxin,et al.  DIME: A Novel Framework for De Novo Metagenomic Sequence Assembly , 2015 .

[3]  Katherine H. Huang,et al.  Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning , 2015, Nature Biotechnology.

[4]  Kevin Leyton-Brown,et al.  Sequential Model-Based Optimization for General Algorithm Configuration , 2011, LION.

[5]  Charles Y Chiu,et al.  Clinical metagenomics , 2019, Nature Reviews Genetics.

[6]  R. Franklin,et al.  MinION TM nanopore sequencing of environmental metagenomes: a synthetic approach , 2017 .

[7]  P. Pevzner,et al.  metaSPAdes: a new versatile metagenomic assembler. , 2017, Genome research.

[8]  Zhong Wang,et al.  Reconstructing single genomes from complex microbial communities , 2016, it Inf. Technol..

[9]  Dongwan D. Kang,et al.  MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities , 2015, PeerJ.

[10]  Luis Pedro Coelho,et al.  Structure and function of the global ocean microbiome , 2015, Science.

[11]  Jan-Fang Cheng,et al.  Next generation sequencing data of a defined microbial mock community , 2016, Scientific Data.

[12]  Réka Albert,et al.  Near linear time algorithm to detect community structures in large-scale networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[13]  Yi Pan,et al.  DIME: A Novel Framework for De Novo Metagenomic Sequence Assembly , 2015, J. Comput. Biol..

[14]  Mick Watson,et al.  A Review of Bioinformatics Tools for Bio-Prospecting from Metagenomic Sequence Data , 2017, Front. Genet..

[15]  Leonid Oliker,et al.  Extreme Scale De Novo Metagenome Assembly , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  S. Tringe,et al.  Tackling soil diversity with the assembly of large, complex metagenomes , 2014, Proceedings of the National Academy of Sciences.

[17]  Edward M. Rubin,et al.  Metagenomics: DNA sequencing of environmental samples , 2005, Nature Reviews Genetics.

[18]  Michael Roberts,et al.  Reducing storage requirements for biological sequence comparison , 2004, Bioinform..

[19]  Axel Visel,et al.  the sheep rumen microbiome Methane yield phenotypes linked to differential gene expression in , 2014 .

[20]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[21]  B. Chor,et al.  Genomic DNA k-mer spectra: models and modalities , 2009, Genome Biology.

[22]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[23]  N. Segata,et al.  Shotgun metagenomics, from sampling to analysis , 2017, Nature Biotechnology.

[24]  Chien-Chi Lo,et al.  Rapid evaluation and quality control of next generation sequencing data with FaQCs , 2014, BMC Bioinformatics.

[25]  Philip D. Blood,et al.  Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software , 2017, Nature Methods.

[26]  J. Gilbert,et al.  Metagenomics - a guide from sampling to data analysis , 2012, Microbial Informatics and Experimentation.

[27]  Florian P Breitwieser,et al.  A review of methods and databases for metagenomic classification and assembly , 2019, Briefings Bioinform..