Binning microbial genomes using deep learning

Identification and reconstruction of microbial species from metagenomics wide genome sequencing data is an important and challenging task. Current existing approaches rely on gene or contig co-abundance information across multiple samples and k-mer composition information in the sequences. Here we use recent advances in deep learning to develop an algorithm that uses variational autoencoders to encode co-abundance and compositional information prior to clustering. We show that the deep network is able to integrate these two heterogeneous datasets without any prior knowledge and that our method outperforms existing state-of-the-art by reconstructing 1.8 - 8 times more highly precise and complete genome bins from three different benchmark datasets. Additionally, we apply our method to a gene catalogue of almost 10 million genes and 1,270 samples from the human gut microbiome. Here we are able to cluster 1.3 - 1.8 million extra genes and reconstruct 117 - 246 more highly precise and complete bins of which 70 bins were completely new compared to previous methods. Our method Variational Autoencoders for Metagenomic Binning (VAMB) is freely available at: https://github.com/jakobnissen/vamb

[1]  Jun Wang,et al.  Metagenome-wide association studies: fine-mining the microbiome , 2016, Nature Reviews Microbiology.

[2]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[3]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[4]  Stefano Ermon,et al.  InfoVAE: Balancing Learning and Inference in Variational Autoencoders , 2019, AAAI.

[5]  K. Kupkova,et al.  Bioinformatics strategies for taxonomy independent binning and visualization of sequences in shotgun metagenomics , 2016, Computational and structural biotechnology journal.

[6]  Murray Shanahan,et al.  Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders , 2016, ArXiv.

[7]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[8]  Blake A. Simmons,et al.  MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets , 2016, Bioinform..

[9]  Johannes Alneberg,et al.  CONCOCT: Clustering cONtigs on COverage and ComposiTion , 2013, 1312.4038.

[10]  Ivan Rychlik,et al.  Whole genome sequencing and function prediction of 133 gut anaerobes isolated from chicken caecum in pure cultures , 2018, BMC Genomics.

[11]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[12]  P. Hugenholtz,et al.  Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes , 2013, Nature Biotechnology.

[13]  Connor T. Skennerton,et al.  CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes , 2015, Genome research.

[14]  Dongwan D. Kang,et al.  MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities , 2015, PeerJ.

[15]  Nicholas M. Luscombe,et al.  Generative adversarial networks simulate gene expression and predict perturbations in single cells , 2018, bioRxiv.

[16]  N. Segata,et al.  Shotgun metagenomics, from sampling to analysis , 2017, Nature Biotechnology.

[17]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[18]  Anne Condon,et al.  Interpretable dimensionality reduction of single cell transcriptome data with deep generative models , 2017, Nature Communications.

[19]  Thomas Rattei,et al.  High definition for systems biology of microbial communities: metagenomics gets genome-centric and strain-resolved. , 2016, Current opinion in biotechnology.

[20]  Duy Tin Truong,et al.  MetaPhlAn2 for enhanced metagenomic taxonomic profiling , 2015, Nature Methods.

[21]  Alexander J Probst,et al.  Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy , 2017, Nature Microbiology.

[22]  Nicholas M. Luscombe,et al.  Generative adversarial networks simulate gene expression and predict perturbations in single cells , 2018, bioRxiv.

[23]  Ole Winther,et al.  Ladder Variational Autoencoders , 2016, NIPS.

[24]  S. Ehrlich,et al.  Abundance-based reconstitution of microbial pan-genomes from whole-metagenome shotgun sequencing data , 2017 .

[25]  P. Pevzner,et al.  metaSPAdes: a new versatile metagenomic assembler. , 2017, Genome research.

[26]  Jens Roat Kultima,et al.  Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes , 2014, Nature Biotechnology.

[27]  Casper Kaae Sønderby,et al.  scVAE: variational auto-encoders for single-cell gene expression data , 2020, Bioinform..

[28]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[29]  N. V. Vinodchandran,et al.  Interpretable Classification via Supervised Variational Autoencoders and Differentiable Decision Trees , 2018 .

[30]  Blaine A. Price,et al.  Remote electronic examinations: student experiences , 2002, Br. J. Educ. Technol..

[31]  Jens Roat Kultima,et al.  An integrated catalog of reference genes in the human gut microbiome , 2014, Nature Biotechnology.

[32]  R. Morris,et al.  Untangling Genomes from Metagenomes: Revealing an Uncultured Class of Marine Euryarchaeota , 2012, Science.

[33]  I. Saeed,et al.  Unsupervised discovery of microbial population structure within metagenomes using nucleotide base composition , 2011, Nucleic acids research.

[34]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[35]  Ole Lund,et al.  MGmapper: Reference based mapping and taxonomy annotation of metagenomics sequence reads , 2017, PloS one.

[36]  Philip D. Blood,et al.  Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software , 2017, Nature Methods.

[37]  Yuan Jiang,et al.  BMC3C: binning metagenomic contigs using codon usage, sequence composition and read coverage , 2018, Bioinform..

[38]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[39]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[40]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[41]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[42]  Luigi Acerbi,et al.  Advances in Neural Information Processing Systems 27 , 2014 .

[43]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[44]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[45]  Donovan Parks,et al.  GroopM: an automated tool for the recovery of population genomes from related metagenomes , 2014, PeerJ.

[46]  Falk Hildebrand,et al.  Structure and function of the global topsoil microbiome , 2018, Nature.

[47]  Frédéric Magoulès,et al.  MSPminer: abundance-based reconstitution of microbial pan-genomes from shotgun meta-genomic data , 2018 .

[48]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[49]  一樹 美添,et al.  5分で分かる! ? 有名論文ナナメ読み:Silver, D. et al. : Mastering the Game of Go without Human Knowledge , 2018 .

[50]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[51]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[52]  R. Amann,et al.  Application of tetranucleotide frequencies for the assignment of genomic fragments. , 2004, Environmental microbiology.