论文信息 - DeepMAsED: Evaluating the quality of metagenomic assemblies - 字舞流文

DeepMAsED: Evaluating the quality of metagenomic assemblies

Motivation/background Methodological advances in metagenome assembly are rapidly increasing in the number of published metagenome assemblies. However, identifying misassemblies is challenging due to a lack of closely related reference genomes that can act as pseudo ground truth. Existing reference-free methods are no longer maintained, can make strong assumptions that may not hold across a diversity of research projects, and have not been validated on large scale metagenome assemblies. Results We present DeepMAsED, a deep learning approach for identifying misassembled contigs without the need for reference genomes. Moreover, we provide an in silico pipeline for generating large-scale, realistic metagenome assemblies for comprehensive model training and testing. DeepMAsED accuracy substantially exceeds the state-of-the-art when applied to large and complex metagenome assemblies. Our model estimates close to a 5% contig misassembly rate in two recent large-scale metagenome assembly publications. Conclusions DeepMAsED accurately identifies misassemblies in metagenome-assembled contigs from a broad diversity of bacteria and archaea without the need for reference genomes or strong modelling assumptions. Running DeepMAsED is straight-forward, as well as is model re-training with our dataset generation pipeline. Therefore, DeepMAsED is a flexible misassembly classifier that can be applied to a wide range of metagenome assembly projects. Availability DeepMAsED is available from GitHub at https://github.com/leylabmpi/DeepMAsED.

Mateo Rojas-Carulla | Ruth E. Ley | Nicholas D. Youngblut | Bernhard Schoelkopf | B. Schoelkopf | Mateo Rojas-Carulla | R. Ley

[1] Zhong Wang,et al. ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies , 2013, Bioinform..

[2] Edoardo Pasolli,et al. Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle , 2019, Cell.

[3] P. Pevzner,et al. metaSPAdes: a new versatile metagenomic assembler. , 2017, Genome research.

[4] Ying Wang,et al. Assessment of metagenomic assemblers based on hybrid reads of real and simulated metagenomic sequences , 2019, Briefings Bioinform..

[5] Leping Li,et al. ART: a next-generation sequencing read simulator , 2012, Bioinform..

[6] Alexey A. Gurevich,et al. MetaQUAST: evaluation of metagenome assemblies , 2016, Bioinform..

[7] Robert D. Finn,et al. A new genomic blueprint of the human gut microbiota , 2019, Nature.

[8] Gonçalo R. Abecasis,et al. The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[9] Lalana Kagal,et al. Explaining Explanations: An Overview of Interpretability of Machine Learning , 2018, 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA).

[10] Connor T. Skennerton,et al. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes , 2015, Genome research.

[11] James R Cole,et al. Nonpareil 3: Fast Estimation of Metagenomic Coverage and Sequence Diversity , 2018, mSystems.

[12] Avanti Shrikumar,et al. Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[13] Shuiquan Tang,et al. Ultra-deep, long-read nanopore sequencing of mock microbial community standards , 2018 .

[14] Donovan H. Parks,et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life , 2018, Nature Biotechnology.

[15] Kunihiko Sadakane,et al. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[16] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[18] Piotr Wojtek Dabrowski,et al. SuRankCo: supervised ranking of contigs in de novo assemblies , 2015, BMC Bioinformatics.

[19] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[20] Steven L Salzberg,et al. Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.