Deconvolute individual genomes from metagenome sequences through short read clustering

Metagenome assembly from short next-generation sequencing data is a challenging process due to its large scale and computational complexity. Clustering short reads by species before assembly offers a unique opportunity for parallel downstream assembly of genomes with individualized optimization. However, current read clustering methods suffer either false negative (under-clustering) or false positive (over-clustering) problems. Here we extended our previous read clustering software, SpaRC, by exploiting statistics derived from multiple samples in a dataset to reduce the under-clustering problem. Using synthetic and real-world datasets we demonstrated that this method has the potential to cluster almost all of the short reads from genomes with sufficient sequencing coverage. The improved read clustering in turn leads to improved downstream genome assembly quality.

[1]  Dongwan D. Kang,et al.  MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities , 2015, PeerJ.

[2]  Matteo Comin,et al.  MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures , 2016, Bioinform..

[3]  Zhong Wang,et al.  Reconstructing single genomes from complex microbial communities , 2016, it Inf. Technol..

[4]  Michael Roberts,et al.  Reducing storage requirements for biological sequence comparison , 2004, Bioinform..

[5]  Alexey A. Gurevich,et al.  MetaQUAST: evaluation of metagenome assemblies , 2016, Bioinform..

[6]  Paul Medvedev,et al.  Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[7]  Xiandong Meng,et al.  SpaRC: Scalable Sequence Clustering using Apache Spark , 2018, bioRxiv.

[8]  Yi Pan,et al.  DIME: A Novel Framework for De Novo Metagenomic Sequence Assembly , 2015, J. Comput. Biol..

[9]  S. Tringe,et al.  Tackling soil diversity with the assembly of large, complex metagenomes , 2014, Proceedings of the National Academy of Sciences.

[10]  Shuigeng Zhou,et al.  A New Unsupervised Binning Approach for Metagenomic Sequences Based on N-grams and Automatic Feature Weighting , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  N. Segata,et al.  Shotgun metagenomics, from sampling to analysis , 2017, Nature Biotechnology.

[12]  Mick Watson,et al.  A Review of Bioinformatics Tools for Bio-Prospecting from Metagenomic Sequence Data , 2017, Front. Genet..

[13]  Axel Visel,et al.  the sheep rumen microbiome Methane yield phenotypes linked to differential gene expression in , 2014 .

[14]  R. Franklin,et al.  MinION TM nanopore sequencing of environmental metagenomes: a synthetic approach , 2017 .

[15]  WangJianxin,et al.  DIME: A Novel Framework for De Novo Metagenomic Sequence Assembly , 2015 .

[16]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[17]  Luis Pedro Coelho,et al.  Structure and function of the global ocean microbiome , 2015, Science.

[18]  Jan-Fang Cheng,et al.  Next generation sequencing data of a defined microbial mock community , 2016, Scientific Data.

[19]  Réka Albert,et al.  Near linear time algorithm to detect community structures in large-scale networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[20]  Feng Li,et al.  MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies , 2019, PeerJ.

[21]  Jinyang Zhao,et al.  Genome sequencing of the sweetpotato whitefly Bemisia tabaci MED/Q , 2017, GigaScience.

[22]  Edward M. Rubin,et al.  Metagenomics: DNA sequencing of environmental samples , 2005, Nature Reviews Genetics.

[23]  B. Chor,et al.  Genomic DNA k-mer spectra: models and modalities , 2009, Genome Biology.

[24]  Katherine H. Huang,et al.  Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning , 2015, Nature Biotechnology.

[25]  P. Pevzner,et al.  metaSPAdes: a new versatile metagenomic assembler. , 2017, Genome research.

[26]  Philip D. Blood,et al.  Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software , 2017, Nature Methods.

[27]  J. Gilbert,et al.  Metagenomics - a guide from sampling to data analysis , 2012, Microbial Informatics and Experimentation.

[28]  Florian P Breitwieser,et al.  A review of methods and databases for metagenomic classification and assembly , 2019, Briefings Bioinform..

[29]  Chien-Chi Lo,et al.  Rapid evaluation and quality control of next generation sequencing data with FaQCs , 2014, BMC Bioinformatics.