Identifying viruses from metagenomic data using deep learning

Background The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture. Existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data. Methods Here we developed a reference-free and alignment-free machine learning method, DeepVirFinder, for identifying viral sequences in metagenomic data using deep learning. Results Trained based on sequences from viral RefSeq discovered before May 2015, and evaluated on those discovered after that date, DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths, achieving AUROC 0.93, 0.95, 0.97, and 0.98 for 300, 500, 1000, and 3000 bp sequences respectively. Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under-represented. Applying DeepVirFinder to real human gut metagenomic samples, we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma (CRC). Ten bins were found associated with the cancer status, suggesting viruses may play important roles in CRC. Conclusions Powered by deep learning and high throughput sequencing metagenomic data, DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.

[1]  Jun Yu,et al.  Gut mucosal microbiome across stages of colorectal carcinogenesis , 2015, Nature Communications.

[2]  T. Glenn Field guide to next‐generation DNA sequencers , 2011, Molecular ecology resources.

[3]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[4]  Jasper Snoek,et al.  Likelihood Ratios for Out-of-Distribution Detection , 2019, NeurIPS.

[5]  D. McConnell,et al.  Selection pressures on codon usage in the complete genome of bacteriophage T7 , 1985, Journal of Molecular Evolution.

[6]  Duy Tin Truong,et al.  MetaPhlAn2 for enhanced metagenomic taxonomic profiling , 2015, Nature Methods.

[7]  François Enault,et al.  Metavir: a web server dedicated to virome analysis , 2011, Bioinform..

[8]  Robert A. Edwards,et al.  PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies , 2012, Nucleic acids research.

[9]  P. Bork,et al.  Colorectal Cancer and the Human Gut Microbiome: Reproducibility with Whole-Genome Shotgun Sequencing , 2016, PloS one.

[10]  Natalia N. Ivanova,et al.  Nontargeted virus sequence discovery pipeline and virus clustering for metagenomic data , 2017, Nature Protocols.

[11]  Peer Bork,et al.  Ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses , 2016, Nature.

[12]  Martin J Blaser,et al.  Evidence of host-virus co-evolution in tetranucleotide usage patterns of bacteriophages and eukaryotic viruses , 2006, BMC Genomics.

[13]  Ting Chen,et al.  COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO‐alignment and paired‐end read LinkAge , 2016, Bioinform..

[14]  B. Stewart,et al.  World Cancer Report , 2003 .

[15]  Tanja Woyke,et al.  Viral dark matter and virus–host interactions resolved from publicly available microbial genomes , 2015, eLife.

[16]  Bas E. Dutilh,et al.  Computational approaches to predict bacteriophage–host relationships , 2015, FEMS microbiology reviews.

[17]  S. Salzberg,et al.  Centrifuge: rapid and sensitive classification of metagenomic sequences , 2016, bioRxiv.

[18]  Alessandra Carbone,et al.  Codon Bias is a Major Factor Explaining Phage Evolution in Translationally Biased Hosts , 2008, Journal of Molecular Evolution.

[19]  D. Gifford,et al.  Predicting the impact of non-coding variants on DNA methylation , 2016, bioRxiv.

[20]  Herbert Tilg,et al.  Gut microbiome development along the colorectal adenoma-carcinoma sequence , 2015 .

[21]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[22]  Thomas Colthurst,et al.  A universal SNP and small-indel variant caller using deep neural networks , 2018, Nature Biotechnology.

[23]  D. Fouts Phage_Finder: Automated identification and classification of prophage regions in complete bacterial genome sequences , 2006, Nucleic acids research.

[24]  Jacques van Helden,et al.  Prophinder: a computational tool for prophage prediction in prokaryotic genomes , 2008, Bioinform..

[25]  W. Wasserman,et al.  Genome-wide prediction of cis-regulatory regions using supervised deep learning methods , 2016, BMC Bioinformatics.

[26]  D. Gifford,et al.  Predicting the impact of non-coding variants on DNA methylation , 2016 .

[27]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[28]  Silvio C. E. Tosatto,et al.  The Pfam protein families database in 2019 , 2018, Nucleic Acids Res..

[29]  Haohan Wang,et al.  Deep Learning for Genomics: A Concise Overview , 2018, ArXiv.

[30]  Yee Whye Teh,et al.  Detecting Out-of-Distribution Inputs to Deep Generative Models Using a Test for Typicality , 2019, ArXiv.

[31]  David S. Wishart,et al.  PHASTER: a better, faster version of the PHAST phage search tool , 2016, Nucleic Acids Res..

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  W. E,et al.  DeFine: deep convolutional neural networks accurately quantify intensities of transcription factor-DNA binding and facilitate evaluation of functional non-coding variants , 2018, Nucleic acids research.

[34]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[35]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[36]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[37]  Jie Ren,et al.  Alignment-free \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$d_2^*$\end{document} oligonucleotide frequency dissi , 2016, Nucleic acids research.

[38]  R. Edwards,et al.  A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes , 2014, Nature Communications.

[39]  Yang Young Lu,et al.  VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data , 2017, Microbiome.

[40]  Daniel Quang,et al.  FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data , 2017, bioRxiv.

[41]  Jenny Sauk,et al.  Disease-Specific Alterations in the Enteric Virome in Inflammatory Bowel Disease , 2015, Cell.

[42]  David K. Gifford,et al.  Convolutional neural network architectures for predicting DNA–protein binding , 2016, Bioinform..

[43]  Raul Andino,et al.  The role of mutational robustness in RNA virus evolution , 2013, Nature Reviews Microbiology.

[44]  Matthew B. Sullivan,et al.  VirSorter: mining viral signal from microbial genomic data , 2015, PeerJ.

[45]  Thomas G. Dietterich,et al.  Deep Anomaly Detection with Outlier Exposure , 2018, ICLR.

[46]  Izhak Haviv,et al.  Colorectal Cancer Prevention , 2017, The American Journal of Gastroenterology.

[47]  Li Song,et al.  Centrifuge: rapid and sensitive classification of metagenomic sequences , 2016 .

[48]  Jun Li,et al.  Mining, analyzing, and integrating viral signals from metagenomic data , 2019, Microbiome.

[49]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[50]  M. Gouy,et al.  Codon usage in bacteria: correlation with gene expressivity. , 1982, Nucleic acids research.

[51]  Alise J. Ponsero,et al.  The Promises and Pitfalls of Machine Learning for Detecting Viruses in Aquatic Metagenomes , 2019, Front. Microbiol..

[52]  R. Srikant,et al.  Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks , 2017, ICLR.

[53]  Yi Li,et al.  Understanding sequence conservation with deep learning , 2017, bioRxiv.

[54]  S. Rampelli,et al.  ViromeScan: a new tool for metagenomic viral community profiling , 2016, BMC Genomics.

[55]  F. Bushman,et al.  The human gut virome: inter-individual variation and dynamic response to diet. , 2011, Genome research.

[56]  Daniel J. Nasko,et al.  VIROME: a standard operating procedure for analysis of viral metagenome sequences , 2012, Standards in genomic sciences.

[57]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[58]  Jie Tan,et al.  PPR-Meta: a tool for identifying phages and plasmids from metagenomic fragments using deep learning , 2019, GigaScience.

[59]  Forest Rohwer,et al.  Gut DNA viromes of Malawian twins discordant for severe acute malnutrition , 2015, Proceedings of the National Academy of Sciences.

[60]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[61]  Chenli Liu,et al.  A human gut phage catalog correlates the gut phageome with type 2 diabetes , 2018, Microbiome.

[62]  James J. Little,et al.  Does Your Model Know the Digit 6 Is Not a Cat? A Less Biased Evaluation of "Outlier" Detectors , 2018, ArXiv.

[63]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[64]  João C. Setubal,et al.  MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins , 2018, Front. Genet..

[65]  B. Póczos,et al.  Predicting Enhancer-Promoter Interaction from Genomic Sequence with Deep Neural Networks , 2016, bioRxiv.

[66]  Jens Roat Kultima,et al.  Potential of fecal microbiota for early‐stage detection of colorectal cancer , 2014 .