Viral Genome Deep Classifier

The task of virus classification into subtypes is an important concern in many categorization studies, e.g., in virology or epidemiology. Therefore, the problem of virus subtyping has been a subject of considerable interest in the last decade. Although there exist several virus subtyping tools, they are often dedicated to a specific family of viruses. Even specialized methods, however, often fail to correctly subtype viruses, such as HIV or influenza. To address these shortcomings, we present a viral genome deep classifier (VGDC)—a tool for an automatic virus subtyping, which employs a deep convolutional neural network (CNN). The method is universal and can be applied for subtyping any virus, as confirmed by experiments on dengue, hepatitis B and C, HIV-1, and influenza A datasets. For all considered virus types, the obtained classification rates are very high with the corresponding values of the F1-score ranging from about 0.85 to 1.00 depending on the virus type and the considered number of subtypes. For HIV-1 and influenza A, the VGDC significantly outperforms the leading competitors, including CASTOR and COMET. The VGDC source code is freely available to facilitate easy usage and comparison with future approaches.

[1]  Daniel Quang,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015 .

[2]  Sergei L. Kosakovsky Pond,et al.  An Evolutionary Model-Based Algorithm for Accurate Phylogenetic Breakpoint Mapping and Subtype Prediction in HIV-1 , 2009, PLoS Comput. Biol..

[3]  Lukasz Kurgan,et al.  DeepCNF-D: Predicting Protein Order/Disorder Regions by Weighted Deep Convolutional Neural Fields , 2015, International journal of molecular sciences.

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Anne-Mieke Vandamme,et al.  Automated subtyping of HIV-1 genetic sequences for clinical and surveillance , 2013 .

[6]  Tulio de Oliveira,et al.  A standardized framework for accurate, high-throughput genotyping of recombinant and non-recombinant viral sequences , 2009, Nucleic Acids Res..

[7]  Haohan Wang,et al.  Deep Learning for Genomics: A Concise Overview , 2018, ArXiv.

[8]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[9]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[10]  William John Teahan,et al.  A repetition based measure for verification of text collections and for text categorization , 2003, SIGIR.

[11]  David R. Kelley,et al.  Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015 .

[12]  Jianlin Cheng,et al.  A Deep Learning Network Approach to ab initio Protein Secondary Structure Prediction , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[13]  Jianyang Zeng,et al.  A deep learning framework for modeling structural features of RNA-binding protein targets , 2015, Nucleic acids research.

[14]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15]  Jianlin Cheng,et al.  DNdisorder: predicting protein disorder using boosting and deep networks , 2013, BMC Bioinformatics.

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[18]  William John Teahan,et al.  Context-based methods for text categorisation , 2004, SIGIR '04.

[19]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Anne E Carpenter,et al.  Opportunities and obstacles for deep learning in biology and medicine , 2017, bioRxiv.

[21]  Feng Liu,et al.  Deep Learning and Its Applications in Biomedicine , 2018, Genom. Proteom. Bioinform..

[22]  Ahmed Halioui,et al.  A machine learning approach for viral genome classification , 2017, BMC Bioinformatics.

[23]  Antonino Fiannaca,et al.  A Deep Learning Approach to DNA Sequence Classification , 2015, CIBB.

[24]  Sander M. Bohte,et al.  An image representation based convolutional network for DNA classification , 2018, ICLR.

[25]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[28]  Lila Kari,et al.  An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes , 2018 .

[29]  Beilun Wang,et al.  Deep GDashboard: Visualizing and Understanding Genomic Sequences Using Deep Neural Networks , 2016, ArXiv.

[30]  Cheng Ling,et al.  An efficient CNN-based classification on G-protein Coupled Receptors using TF-IDF and N-gram , 2017, 2017 IEEE Symposium on Computers and Communications (ISCC).

[31]  Zhi Wei,et al.  DeepPolyA: A Convolutional Neural Network Approach for Polyadenylation Site Prediction , 2018, IEEE Access.

[32]  Aravind Subramanian,et al.  Gene expression inference with deep learning , 2015 .

[33]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[34]  Zhengxin Chen,et al.  Applying neural networks to classify influenza virus antigenic types and hosts , 2010, 2010 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[35]  Alaa Eddin Alchalabi,et al.  Taxonomic Classification for Living Organisms Using Convolutional Neural Networks , 2017, Genes.

[36]  Glenn Lawyer,et al.  COMET: adaptive context-based modeling for ultrafast HIV-1 subtype identification , 2014, Nucleic acids research.

[37]  O. Stegle,et al.  Deep learning for computational biology , 2016, Molecular systems biology.