Vorpal: A novel RNA virus feature-extraction algorithm demonstrated through interpretable genotype-to-phenotype linear models

In the analysis of genomic sequence data, so-called “alignment free” approaches are often selected for their relative speed compared to alignment-based approaches, especially in the application of distance comparisons and taxonomic classification1,2,3,4. These methods are typically reliant on excising K-length substrings of the input sequence, called K-mers5. In the context of machine learning, K-mer based feature vectors have been used in applications ranging from amplicon sequencing classification to predictive modeling for antimicrobial resistance genes6,7,8. This can be seen as an analogy of the “bag-of-words” model successfully employed in natural language processing and computer vision for document and image classification9,10. Feature extraction techniques from natural language processing have previously been analogized to genomics data11; however, the “bag-of-words” approach is brittle in the RNA virus space due to the high intersequence variance and the exact matching requirement of K-mers. To reconcile the simplicity of “bag-of-words” methods with the complications presented by the intrinsic variance of RNA virus space, a method to resolve the fragility of extracted K-mers in a way that faithfully reflects an underlying biological phenomenon was devised. Our algorithm, Vorpal, allows the construction of interpretable linear models with clustered, representative ‘degenerate’ K-mers as the input vector and, through regularization, sparse predictors of binary phenotypes as the output. Here, we demonstrate the utility of Vorpal by identifying nucleotide-level genomic motif predictors for binary phenotypes in three separate RNA virus clades; human pathogen vs. non-human pathogen in Orthocoronavirinae, hemorrhagic fever causing vs. non-hemorrhagic fever causing in Ebolavirus, and human-host vs. non-human host in Influenza A. The capacity of this approach for in silico identification of hypotheses which can be validated by direct experimentation, as well as identification of genomic targets for preemptive biosurveillance of emerging viruses, is discussed. The code is available for download at https://github.com/mriglobal/vorpal.

[1]  Yan Li,et al.  Human infection with a triple-reassortant swine influenza A(H1N1) virus containing the hemagglutinin and neuraminidase genes of seasonal influenza virus. , 2010, The Journal of infectious diseases.

[2]  Jonas S. Almeida,et al.  Alignment-free sequence comparison: benefits, applications, and tools , 2017, Genome Biology.

[3]  B. Crossett,et al.  Site-specific glycosylation profile of influenza A (H1N1) hemagglutinin through tandem mass spectrometry , 2018, Human vaccines & immunotherapeutics.

[4]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[5]  Shuicheng Yan,et al.  Local Word Bag Model for Text Categorization , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[6]  Ryan Cotterell,et al.  An Analysis of Lemmatization on Topic Models of Morphologically Rich Language , 2016 .

[7]  Mike Mikailov,et al.  A Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection , 2018, mSphere.

[8]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[9]  M. Pema,et al.  Codon Optimization Leads to Functional Impairment of RD114-TR Envelope Glycoprotein , 2017, Molecular therapy. Methods & clinical development.

[10]  Dominique Lavenier,et al.  DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[11]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[12]  I. Nookaew,et al.  Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer , 2017, Scientific Reports.

[13]  K. Subbarao,et al.  Mammalian Adaptation in the PB2 Gene of Avian H5N1 Influenza Virus , 2013, Journal of Virology.

[14]  M. Nelson,et al.  Transmission dynamics of pandemic influenza A(H1N1)pdm09 virus in humans and swine in backyard farms in Tumbes, Peru , 2015, Influenza and other respiratory viruses.

[15]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[16]  E. Domingo,et al.  Viral Quasispecies Evolution , 2012, Microbiology and Molecular Reviews.

[17]  Patricia L. Clark,et al.  Rare Codons Cluster , 2008, PloS one.

[18]  N. Sriwilaijaroen,et al.  Molecular basis of the structure and function of H1 hemagglutinin of influenza virus , 2012, Proceedings of the Japan Academy. Series B, Physical and biological sciences.

[19]  Toshimichi Ikemura,et al.  Codon usage tabulated from international DNA sequence databases: status for the year 2000 , 2000, Nucleic Acids Res..

[20]  S. Kurtz,et al.  A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes , 2008, BMC Genomics.

[21]  Daniel Müllner,et al.  Modern hierarchical, agglomerative clustering algorithms , 2011, ArXiv.

[22]  Magnus Sahlgren,et al.  The Distributional Hypothesis , 2008 .

[23]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[24]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[25]  Martin Michaelis,et al.  Is the Bombali virus pathogenic in humans? , 2019, Bioinform..

[26]  Peter L. Williams,et al.  Skip the Alignment: Degenerate, Multiplex Primer and Probe Design Using K-mer Matching Instead of Alignments , 2012, PloS one.

[27]  Ning Wang,et al.  Discovery of a rich gene pool of bat SARS-related coronaviruses provides new insights into the origin of SARS coronavirus , 2017, PLoS pathogens.

[28]  Olga B. Jonas,et al.  Do we need a Global Virome Project? , 2019, The Lancet Global Health.

[29]  Sophia Ananiadou,et al.  Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty , 2009, ACL.

[30]  Daniel Falush,et al.  Metapalette: A k-Mer Painting Approach for Metagenomic Taxonomic Profiling and Quantification of Novel Strain Variation , 2016 .

[31]  Graham M. West,et al.  Mass Spectrometry Approach and ELISA Reveal the Effect of Codon Optimization on N-Linked Glycosylation of HIV-1 gp120 , 2014, Journal of proteome research.

[32]  F. Raymond,et al.  Phenetic Comparison of Prokaryotic Genomes Using k-mers , 2017, Molecular biology and evolution.

[33]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[34]  H. Feldmann,et al.  An Upstream Open Reading Frame Modulates Ebola Virus Polymerase Translation and Virus Replication , 2013, PLoS pathogens.

[35]  Amanda Balish,et al.  Triple-reassortant swine influenza A (H1) in humans in the United States, 2005-2009. , 2009, The New England journal of medicine.

[36]  Alejandro A. Schäffer,et al.  Virus Variation Resource – improved response to emergent viral outbreaks , 2016, Nucleic Acids Res..

[37]  Vineet K. Sharma,et al.  16S Classifier: A Tool for Fast and Accurate Taxonomic Classification of 16S rRNA Hypervariable Regions in Metagenomic Datasets , 2015, PloS one.

[38]  Daniel Müllner,et al.  fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python , 2013 .

[39]  François Laviolette,et al.  Interpretable genotype-to-phenotype classifiers with performance guarantees , 2018, Scientific Reports.

[40]  C. Carlson,et al.  Global estimates of mammalian viral diversity accounting for host sharing , 2019, Nature Ecology & Evolution.

[41]  M. Saijo,et al.  Genome structure of Ebola virus subtype Reston: differences among Ebola subtypes , 2001, Archives of Virology.

[42]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[43]  L. Kit,et al.  A revision of the system of nomenclature for influenza viruses: a WHO memorandum. , 1980, Bulletin of the World Health Organization.

[44]  Daniel Falush,et al.  MetaPalette: a k-mer Painting Approach for Metagenomic Taxonomic Profiling and Quantification of Novel Strain Variation , 2016, mSystems.

[45]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[46]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[47]  S. Bornstein,et al.  MERS and the dromedary camel trade between Africa and the Middle East , 2016, Tropical Animal Health and Production.

[48]  Y. Guan,et al.  SARS-CoV Infection in a Restaurant from Palm Civet , 2005, Emerging infectious diseases.