A machine learning approach for viral genome classification

BackgroundAdvances in cloning and sequencing technology are yielding a massive number of viral genomes. The classification and annotation of these genomes constitute important assets in the discovery of genomic variability, taxonomic characteristics and disease mechanisms. Existing classification methods are often designed for specific well-studied family of viruses. Thus, the viral comparative genomic studies could benefit from more generic, fast and accurate tools for classifying and typing newly sequenced strains of diverse virus families.ResultsHere, we introduce a virus classification platform, CASTOR, based on machine learning methods. CASTOR is inspired by a well-known technique in molecular biology: restriction fragment length polymorphism (RFLP). It simulates, in silico, the restriction digestion of genomic material by different enzymes into fragments. It uses two metrics to construct feature vectors for machine learning algorithms in the classification step. We benchmark CASTOR for the classification of distinct datasets of human papillomaviruses (HPV), hepatitis B viruses (HBV) and human immunodeficiency viruses type 1 (HIV-1). Results reveal true positive rates of 99%, 99% and 98% for HPV Alpha species, HBV genotyping and HIV-1 M subtyping, respectively. Furthermore, CASTOR shows a competitive performance compared to well-known HIV-1 specific classifiers (REGA and COMET) on whole genomes and pol fragments.ConclusionThe performance of CASTOR, its genericity and robustness could permit to perform novel and accurate large scale virus studies. The CASTOR web platform provides an open access, collaborative and reproducible machine learning classifiers. CASTOR can be accessed at http://castor.bioinfo.uqam.ca.

[1]  Jian Pei,et al.  A brief survey on sequence classification , 2010, SKDD.

[2]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[3]  Tulio de Oliveira,et al.  A standardized framework for accurate, high-throughput genotyping of recombinant and non-recombinant viral sequences , 2009, Nucleic Acids Res..

[4]  David Martin,et al.  Computational Molecular Biology: An Algorithmic Approach , 2001 .

[5]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[6]  Troy Hernandez,et al.  Real Time Classification of Viruses in 12 Dimensions , 2013, PloS one.

[7]  Moshe Ben-Bassat,et al.  35 Use of distance measures, information measures and error bounds in feature evaluation , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[8]  G. Learn,et al.  HIV-1 Nomenclature Proposal , 2000, Science.

[9]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[10]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[11]  Glenn Lawyer,et al.  COMET: adaptive context-based modeling for ultrafast HIV-1 subtype identification , 2014, Nucleic acids research.

[12]  James J. Chen,et al.  Class-imbalanced classifiers for high-dimensional data , 2013, Briefings Bioinform..

[13]  Richard J. Roberts,et al.  REBASE—a database for DNA restriction and modification: enzymes, genes and genomes , 2009, Nucleic Acids Res..

[14]  J. Adams,et al.  Estimation of phylogenetic relationships from DNA restriction patterns and selection of endonuclease cleavage sites. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[15]  A. R. Templeton,et al.  PHYLOGENETIC INFERENCE FROM RESTRICTION ENDONUCLEASE CLEAVAGE SITE MAPS WITH PARTICULAR REFERENCE TO THE EVOLUTION OF HUMANS AND THE APES , 1983, Evolution; international journal of organic evolution.

[16]  Alex van Belkum,et al.  Role of Genomic Typing in Taxonomy, Evolutionary Genetics, and Microbial Epidemiology , 2001, Clinical Microbiology Reviews.

[17]  F. X. Bosch,et al.  Epidemiologic classification of human papillomavirus types associated with cervical cancer. , 2003, The New England journal of medicine.

[18]  Tulio de Oliveira,et al.  An automated genotyping system for analysis of HIV-1 and other microbial sequences , 2005, Bioinform..

[19]  Rui Jorge Nobre,et al.  Complete genotyping of mucosal human papillomavirus using a restriction fragment length polymorphism analysis and an original typing algorithm. , 2008, Journal of clinical virology : the official publication of the Pan American Society for Clinical Virology.

[20]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[21]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[22]  Anastasios Delopoulos,et al.  A Computerized Methodology for Improved Virus Typing by PCR-RFLP Gel Electrophoresis , 2011, IEEE Transactions on Biomedical Engineering.

[23]  Mauro Schechter,et al.  Identification of single and dual infections with distinct subtypes of human immunodeficiency virus type 1 by using restriction fragment length polymorphism analysis , 2004, Virus Genes.

[24]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[25]  Antonio Restivo,et al.  Distance measures for biological sequences: Some recent approaches , 2008, Int. J. Approx. Reason..

[26]  Li-Yeh Chuang,et al.  SNP-RFLPing 2: an updated and integrated PCR-RFLP tool for SNP genotyping , 2010, BMC Bioinformatics.

[27]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[28]  David L. Robertson,et al.  HIV-1 nomenclature proposal: a reference guide to HIV-1 classification. , 2000 .

[29]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[30]  Dhundy Bastola,et al.  Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis , 2014, Briefings Bioinform..

[31]  Joseph Felsenstein,et al.  PHYLOGENIES FROM RESTRICTION SITES: A MAXIMUM‐LIKELIHOOD APPROACH , 1992, Evolution; international journal of organic evolution.

[32]  M. Mizokami,et al.  Hepatitis B virus genotype assignment using restriction fragment length polymorphism patterns , 1999, FEBS letters.

[33]  Yiming Bao,et al.  Improvements to pairwise sequence comparison (PASC): a genome-based web tool for virus classification , 2014, Archives of Virology.

[34]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[35]  Ivan Bajla,et al.  An alternative method for electrophoretic gel image analysis in the GelMaster software , 2005, Comput. Methods Programs Biomed..

[36]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[37]  N Enomoto,et al.  Typing of hepatitis C virus genomes by restriction fragment length polymorphism. , 1991, The Journal of general virology.

[38]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[39]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[40]  Vladimir Makarenkov,et al.  Effect of Hundreds Sequenced Genomes on the Classification of Human Papillomaviruses , 2013, ECDA.

[41]  B. Bell,et al.  The contributions of hepatitis B virus and hepatitis C virus infections to cirrhosis and primary liver cancer worldwide. , 2006, Journal of hepatology.

[42]  E. de Villiers,et al.  Classification of papillomaviruses (PVs) based on 189 PV types and proposal of taxonomic amendments. , 2010, Virology.

[43]  Jean-Yves Nau,et al.  [A new human immunodeficiency virus derived from gorillas]. , 2009, Revue medicale suisse.

[44]  Rok Blagus,et al.  Class prediction for high-dimensional class-imbalanced data , 2010, BMC Bioinformatics.

[45]  C. Peyton,et al.  Identification and assessment of known and novel human papillomaviruses by polymerase chain reaction amplification, restriction fragment length polymorphisms, nucleotide sequence, and phylogenetic algorithms. , 1994, The Journal of infectious diseases.

[46]  S. Schaefer,et al.  Hepatitis B virus taxonomy and hepatitis B virus genotypes. , 2007, World journal of gastroenterology.

[47]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[48]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[49]  A. Gorbalenya,et al.  Partitioning the Genetic Diversity of a Virus Family: Approach and Evaluation through a Case Study of Picornaviruses , 2012, Journal of Virology.

[50]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[51]  Robert C. Williams,et al.  Restriction fragment length polymorphism (RFLP) , 1989 .

[52]  Xiao Sun,et al.  A novel feature-based method for whole genome phylogenetic analysis without alignment: application to HEV genotyping and subtyping. , 2008, Biochemical and Biophysical Research Communications - BBRC.

[53]  E. Virginia Armbrust,et al.  pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree , 2010, BMC Bioinformatics.

[54]  P. Sharp,et al.  A Comprehensive Panel of Near-Full-Length Clones and Reference Sequences for Non-Subtype B Isolates of Human Immunodeficiency Virus Type 1 , 1998, Journal of Virology.