An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes

For many disease-causing virus species, global diversity is clustered into a taxonomy of subtypes with clinical significance. In particular, the classification of infections among the subtypes of human immunodeficiency virus type 1 (HIV-1) is a routine component of clinical management, and there are now many classification algorithms available for this purpose. Although several of these algorithms are similar in accuracy and speed, the majority are proprietary and require laboratories to transmit HIV-1 sequence data over the network to remote servers. This potentially exposes sensitive patient data to unauthorized access, and makes it impossible to determine how classifications are made and to maintain the data provenance of clinical bioinformatic workflows. We propose an open-source supervised and alignment-free subtyping method (Kameris) that operates on k-mer frequencies in HIV-1 sequences. We performed a detailed study of the accuracy and performance of subtype classification in comparison to four state-of-the-art programs. Based on our testing data set of manually curated real-world HIV-1 sequences (n = 2, 784), Kameris obtained an overall accuracy of 97%, which matches or exceeds all other tested software, with a processing rate of over 1,500 sequences per second. Furthermore, our fully standalone general-purpose software provides key advantages in terms of data security and privacy, transparency and reproducibility. Finally, we show that our method is readily adaptable to subtype classification of other viruses including dengue, influenza A, and hepatitis B and C virus.

[1]  Anil K. Jain,et al.  Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Khalid Sayood,et al.  Computational Genomic Signatures , 2011, Computational Genomic Signatures.

[3]  Yang Young Lu,et al.  VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data , 2017, Microbiome.

[4]  Troy Hernandez,et al.  Real Time Classification of Viruses in 12 Dimensions , 2013, PloS one.

[5]  Xiao Sun,et al.  A novel feature-based method for whole genome phylogenetic analysis without alignment: application to HEV genotyping and subtyping. , 2008, Biochemical and Biophysical Research Communications - BBRC.

[6]  Sung-Hou Kim,et al.  Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs) , 2011, Proceedings of the National Academy of Sciences.

[7]  S Karlin,et al.  Heterogeneity of genomes: measures and values. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[9]  Achuthsankar S. Nair,et al.  Combined classifier for unknown genome classification using chaos game representation features , 2010 .

[10]  P. Musoke,et al.  Impact of human immunodeficiency virus type 1 (hiv-1) subtype on women receiving single-dose nevirapine prophylaxis to prevent hiv-1 vertical transmission (hiv network for prevention trials 012 study). , 2001, The Journal of infectious diseases.

[11]  Huldrych F. Günthard,et al.  Mutational Correlates of Virological Failure in Individuals Receiving a WHO-Recommended Tenofovir-Containing First-Line Regimen: An International Collaboration , 2017, EBioMedicine.

[12]  J. Mullins,et al.  HIV Sequence Compendium 2010 , 2010 .

[13]  Changchuan Yin,et al.  Virus classification in 60-dimensional protein space. , 2016, Molecular phylogenetics and evolution.

[14]  W. Blattner,et al.  HIV-1 Epidemic in the Caribbean Is Dominated by Subtype B , 2009, PloS one.

[15]  Sanjiv K. Dwivedi,et al.  Classification of HIV-1 Sequences Using Profile Hidden Markov Models , 2012, PloS one.

[16]  Christophe Combet,et al.  Consensus proposals for a unified system of nomenclature of hepatitis C virus genotypes , 2005, Hepatology.

[17]  A. Izenman Linear Discriminant Analysis , 2013 .

[18]  Jonas S. Almeida,et al.  Alignment-free sequence comparison: benefits, applications, and tools , 2017, Genome Biology.

[19]  Achuthsankar S. Nair,et al.  ANN Based Classification of Unknown Genome Fragments Using Chaos Game Representation , 2010, 2010 Second International Conference on Machine Learning and Computing.

[20]  A. Poon Phylodynamic Inference with Kernel ABC and Its Application to HIV Epidemiology , 2015, Molecular biology and evolution.

[21]  Tulio de Oliveira,et al.  An automated genotyping system for analysis of HIV-1 and other microbial sequences , 2005, Bioinform..

[22]  ARIDAMAN PANDIT,et al.  Analysis of dinucleotide signatures in HIV-1 subtype B genomes , 2013, Journal of Genetics.

[23]  S. Karlin,et al.  Over- and under-representation of short oligonucleotides in DNA sequences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[25]  Troy Hernandez,et al.  Global comparison of multiple-segmented viruses in 12-dimensional genome space. , 2014, Molecular phylogenetics and evolution.

[26]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .

[27]  S. Karlin,et al.  Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Jie Ren,et al.  Prediction of virus-host infectious association by supervised learning methods , 2017, BMC Bioinformatics.

[29]  Ahmed Halioui,et al.  A machine learning approach for viral genome classification , 2017, BMC Bioinformatics.

[30]  Anders Larsson,et al.  AliView: a fast and lightweight alignment viewer and editor for large datasets , 2014, Bioinform..

[31]  Wolfgang Preiser,et al.  Moderate levels of preantiretroviral therapy drug resistance in a generalized epidemic: time for better first-line ART? , 2017, AIDS.

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  Huldrych F. Günthard,et al.  Global epidemiology of drug resistance after failure of WHO recommended first-line regimens for adult HIV-1 infection: a multicentre retrospective cohort study , 2016 .

[34]  Y. S. Thushana Texture features from Chaos Game Representation Images of Genomes , 2013 .

[35]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[36]  Trevor Hastie,et al.  Multi-class AdaBoost ∗ , 2009 .

[37]  S Karlin,et al.  Comparisons of eukaryotic genomic sequences. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Anil K. Jain,et al.  39 Dimensionality and sample size considerations in pattern recognition practice , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[39]  Somdatta Sinha,et al.  Multifractal analysis of HIV-1 genomes. , 2012, Molecular phylogenetics and evolution.

[40]  Lila Kari,et al.  An investigation into inter- and intragenomic variations of graphic genomic signatures , 2015, BMC Bioinformatics.

[41]  Nikesh S. Dattani,et al.  Mapping the Space of Genomic Signatures , 2014, PloS one.

[42]  WolfElizabeth,et al.  Short Communication: Phylogenetic Evidence of HIV-1 Transmission Between Adult and Adolescent Men Who Have Sex with Men. , 2016 .

[43]  S. Hammer,et al.  Antiretroviral drug resistance testing in adult HIV-1 infection: 2008 recommendations of an International AIDS Society-USA panel. , 2008, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[44]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[45]  F. Raffi,et al.  European AIDS Clinical Society (EACS) guidelines for the clinical management and treatment of HIV‐infected adults , 2008, HIV medicine.

[46]  Chidchanok Lursinsap,et al.  A high performance prediction of HPV genotypes by Chaos game representation and singular value decomposition , 2015, BMC Bioinformatics.

[47]  Ming Zhang,et al.  A jumping profile Hidden Markov Model and applications to recombination sites in HIV and HCV genomes , 2006, BMC Bioinformatics.

[48]  Bernhard Haubold,et al.  Alignment-free phylogenetics and population genetics , 2014, Briefings Bioinform..

[49]  I. Williams,et al.  Development of a novel human immunodeficiency virus type 1 subtyping tool, Subtype Analyzer (STAR): analysis of subtype distribution in London. , 2004, AIDS research and human retroviruses.

[50]  Birgit Funke,et al.  College of American Pathologists' laboratory standards for next-generation sequencing clinical tests. , 2015, Archives of pathology & laboratory medicine.

[51]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[52]  Tiee-Jian Wu,et al.  Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences , 2005, Bioinform..

[53]  Diana D. Huang,et al.  Sequence characterization of the protease and partial reverse transcriptase proteins of the NED panel, an international HIV type 1 subtype reference and standards panel. , 2003, AIDS research and human retroviruses.

[54]  G. Learn,et al.  HIV-1 Nomenclature Proposal , 2000, Science.

[55]  Jun S. Liu,et al.  Phylogenomics of nonavian reptiles and the structure of the ancestral amniote genome , 2007, Proceedings of the National Academy of Sciences.

[56]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[57]  A. Mchardy,et al.  The PhyloPythiaS Web Server for Taxonomic Assignment of Metagenome Sequences , 2012, PloS one.

[58]  Chidchanok Lursinsap,et al.  An Efficient Prediction of HPV Genotypes from Partial Coding Sequences by Chaos Game Representation and Fuzzy k-Nearest Neighbor Technique , 2017 .

[59]  S. Hammer,et al.  The challenge of HIV-1 subtype diversity. , 2008, The New England journal of medicine.

[60]  Giovanni Felici,et al.  MISSEL: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification , 2016, BioData Mining.

[61]  Tiee-Jian Wu,et al.  Statistical Measures of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition , 2001, Biometrics.

[62]  Alina A. von Davier,et al.  Cross-Validation , 2014 .

[63]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[64]  E. Arts,et al.  Tracking a century of global expansion and evolution of HIV to drive understanding and to combat disease. , 2011, The Lancet. Infectious diseases.

[65]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[66]  Gene H. Golub,et al.  Matrix computations , 1983 .

[67]  S Karlin,et al.  Compositional biases of bacterial genomes and evolutionary implications , 1997, Journal of bacteriology.

[68]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[69]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[70]  E. Krause Taxicab Geometry: An Adventure in Non-Euclidean Geometry , 1987 .

[71]  Dan Otelea,et al.  Recent HIV-1 Outbreak Among Intravenous Drug Users in Romania: Evidence for Cocirculation of CRF14_BG and Subtype F1 Strains. , 2015, AIDS research and human retroviruses.

[72]  Pandurang Kolekar,et al.  Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping. , 2012, Molecular phylogenetics and evolution.

[73]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[74]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[75]  Jing-Doo Wang,et al.  Comparing Virus Classification using genomic Materials According to Different Taxonomic Levels , 2013, J. Bioinform. Comput. Biol..

[76]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[77]  Ramón Doallo,et al.  CircadiOmics: integrating circadian genomics, transcriptomics, proteomics and metabolomics , 2012, Nature Methods.

[78]  Fereidoun Azizi,et al.  Fast Food Intake Increases the Incidence of Metabolic Syndrome in Children and Adolescents: Tehran Lipid and Glucose Study , 2015, PloS one.

[79]  Achuthsankar S. Nair,et al.  Hurst CGR (HCGR) - A Novel Feature Extraction Method from Chaos Game Representation of Genomes , 2011, ACC.

[80]  Sergei L. Kosakovsky Pond,et al.  An Evolutionary Model-Based Algorithm for Accurate Phylogenetic Breakpoint Mapping and Subtype Prediction in HIV-1 , 2009, PLoS Comput. Biol..

[81]  Steven Wolinsky,et al.  Direct evidence of extensive diversity of HIV-1 in Kinshasa by 1960 , 2008, Nature.

[82]  Giovanni Felici,et al.  LAF: Logic Alignment Free and its application to bacterial genomes classification , 2015, BioData Mining.

[83]  Anne-Mieke Vandamme,et al.  Automated subtyping of HIV-1 genetic sequences for clinical and surveillance purposes: performance evaluation of the new REGA version 3 and seven other tools. , 2013, Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases.

[84]  Kristof Theys,et al.  Epidemic dispersion of HIV and HCV in a population of co-infected Romanian injecting drug users , 2017, PloS one.

[85]  O. Gascuel,et al.  New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. , 2010, Systematic biology.

[86]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[87]  G. Golub,et al.  Updating formulae and a pairwise algorithm for computing sample variances , 1979 .

[88]  Pontiano Kaleebu,et al.  Low drug resistance levels among drug-naive individuals with recent HIV type 1 infection in a rural clinical cohort in southwestern Uganda. , 2012, AIDS research and human retroviruses.

[89]  B. Blaisdell,et al.  Effectiveness of measures requiring and not requiring prior sequence alignment for estimating the dissimilarity of natural sequences , 1989, Journal of Molecular Evolution.

[90]  Glenn Lawyer,et al.  COMET: adaptive context-based modeling for ultrafast HIV-1 subtype identification , 2014, Nucleic acids research.

[91]  Sung-Hou Kim,et al.  Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method , 2009, Proceedings of the National Academy of Sciences.

[92]  Oludayo O. Olugbara,et al.  Identification of Pathogenic Viruses Using Genomic Cepstral Coefficients with Radial Basis Function Neural Network , 2015, NaBIC.

[93]  Tatiana A. Tatusova,et al.  A web-based genotyping resource for viral sequences , 2004, Nucleic Acids Res..

[94]  Yanchun Yang,et al.  Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison , 2008, Bioinform..

[95]  Lila Kari,et al.  Additive methods for genomic signatures , 2016, BMC Bioinformatics.

[96]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[97]  Brian T. Foley,et al.  HIV-1 Subtype and Circulating Recombinant Form (CRF) Reference Sequences, 2005 , 2005 .

[98]  Richard H. Liang,et al.  Origin and Evolution of Human Immunodeficiency Viruses , 2015 .

[99]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[100]  Lila Kari,et al.  MoDMaps3D: an interactive webtool for the quantification and 3D visualization of interrelationships in a dataset of DNA sequences , 2017, Bioinform..

[101]  D. Burke,et al.  Identification of breakpoints in intergenotypic recombinants of HIV type 1 by bootscanning. , 1995, AIDS research and human retroviruses.

[102]  S. Karlin,et al.  Dinucleotide relative abundance extremes: a genomic signature. , 1995, Trends in genetics : TIG.

[103]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[104]  Dhundy Bastola,et al.  Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis , 2014, Briefings Bioinform..

[105]  E V Koonin,et al.  Avoidance of palindromic words in bacterial and archaeal genomes: a close connection with restriction enzymes. , 1997, Nucleic acids research.

[106]  W. Chantratita,et al.  Surveillance of Genotypic Resistance Mutations in Chronic HIV-1 Treated Individuals After Completion of the National Access to Antiretroviral Program in Thailand , 2007, Infection.